You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Levani Kokhreidze <le...@gmail.com> on 2019/10/16 18:25:55 UTC

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Hello all,

Sorry for bringing this thread again, but I would like to get some attention on this PR: https://github.com/apache/kafka/pull/7170 <https://github.com/apache/kafka/pull/7170> 
It's been a while now and I would love to move on to other KIPs as well. Please let me know if you have any concerns.

Regards,
Levani


> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <le...@gmail.com> wrote:
> 
> Hi all,
> 
> Here’s voting thread for this KIP: https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>
> 
> Regards,
> Levani
> 
>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <levani.codes@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hi Matthias,
>> 
>> Thanks for the suggestion. I Don’t have strong opinion on that one.
>> Agree that avoiding unnecessary method overloads is a good idea.
>> 
>> Updated KIP
>> 
>> Regards,
>> Levani
>> 
>> 
>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <matthias@confluent.io <ma...@confluent.io>> wrote:
>>> 
>>> One question:
>>> 
>>> Why do we add
>>> 
>>>> Repartitioned#with(final String name, final int numberOfPartitions)
>>> 
>>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
>>> combination with `withName()` and `withNumberOfPartitions()` should be
>>> sufficient. Users can chain the method calls.
>>> 
>>> (I think it's valuable to keep the number of overload small if possible.)
>>> 
>>> Otherwise LGTM.
>>> 
>>> 
>>> -Matthias
>>> 
>>> 
>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>>> Hello,
>>>> 
>>>> Thanks all for your feedback.
>>>> I started voting procedure for this KIP. If there’re any other concerns about this KIP, please let me know.
>>>> 
>>>> Regards,
>>>> Levani
>>>> 
>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <levani.codes@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> Hi Matthias,
>>>>> 
>>>>> Thanks for the suggestion, makes sense.
>>>>> I’ve updated KIP (https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>>).
>>>>> 
>>>>> Regards,
>>>>> Levani
>>>>> 
>>>>> 
>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <matthias@confluent.io <ma...@confluent.io> <mailto:matthias@confluent.io <ma...@confluent.io>>> wrote:
>>>>>> 
>>>>>> Thanks for driving the KIP.
>>>>>> 
>>>>>> I agree that users need to be able to specify a partitioning strategy.
>>>>>> 
>>>>>> Sophie raises a fair point about topic configs and producer configs. My
>>>>>> take is, that consider `Repartitioned` as an "extension" to `Produced`,
>>>>>> that adds topic configuration, is a good way to think about it and helps
>>>>>> to keep the API "clean".
>>>>>> 
>>>>>> 
>>>>>> With regard to method names. I would prefer to avoid abbreviations. Can
>>>>>> we rename:
>>>>>> 
>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>>> 
>>>>>> Furthermore, it might be good to add some more `static` methods:
>>>>>> 
>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
>>>>>> 
>>>>>> 
>>>>>> -Matthias
>>>>>> 
>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>>> Totally agree. I think in KStream interface it makes sense to have some duplicate configurations between operators in order to keep API simple and usable.
>>>>>>> Also, as more surface API has, harder it is to have proper backward compatibility.
>>>>>>> While initial idea of keeping topic level configs separate was exciting, having Repartitioned class encapsulate some producer level configs makes API more readable.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Levani
>>>>>>> 
>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <sophie@confluent.io <ma...@confluent.io> <mailto:sophie@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>>> 
>>>>>>>> I think that is a good point about trying to keep producer level
>>>>>>>> configurations and (repartition) topic level considerations separate.
>>>>>>>> Number of partitions is definitely purely a topic level configuration. But
>>>>>>>> on some level, serdes and partitioners are just as much a topic
>>>>>>>> configuration as a producer one. You could have two producers configured
>>>>>>>> with different serdes and/or partitioners, but if they are writing to the
>>>>>>>> same topic the result would be very difficult to part. So in a sense, these
>>>>>>>> are configurations of topics in Streams, not just producers.
>>>>>>>> 
>>>>>>>> Another way to think of it: while the Streams API is not always true to
>>>>>>>> this, ideally all the relevant configs for an operator are wrapped into a
>>>>>>>> single object (in this case, Repartitioned). We could instead split out the
>>>>>>>> fields in common with Produced into a separate parameter to keep topic and
>>>>>>>> producer level configurations separate, but this increases the API surface
>>>>>>>> area by a lot. It's much more straightforward to just say "this is
>>>>>>>> everything that this particular operator needs" without worrying about what
>>>>>>>> exactly you're specifying.
>>>>>>>> 
>>>>>>>> I suppose you could alternatively make Produced a field of Repartitioned,
>>>>>>>> but I don't think we do this kind of composition elsewhere in Streams at
>>>>>>>> the moment
>>>>>>>> 
>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <levani.codes@gmail.com <ma...@gmail.com> <mailto:levani.codes@gmail.com <ma...@gmail.com>>>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Bill,
>>>>>>>>> 
>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>> Yes, that makes sense. I’ve updated KIP with `Repartitioned#partitioner`
>>>>>>>>> configuration.
>>>>>>>>> In the beginning, I wanted to introduce a class for topic level
>>>>>>>>> configuration and keep topic level and producer level configurations (such
>>>>>>>>> as Produced) separately (see my second email in this thread).
>>>>>>>>> But while looking at the semantics of KStream interface, I couldn’t really
>>>>>>>>> figure out good operation name for Topic level configuration class and just
>>>>>>>>> introducing `Topic` config class was kinda breaking the semantics.
>>>>>>>>> So I think having Repartitioned class which encapsulates topic and
>>>>>>>>> producer level configurations for internal topics is viable thing to do.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Levani
>>>>>>>>> 
>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <bbejeck@gmail.com <ma...@gmail.com> <mailto:bbejeck@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Lavani,
>>>>>>>>>> 
>>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>>> 
>>>>>>>>>> I'm also a +1 for adding a partition option.  In addition to the reason
>>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>>> 
>>>>>>>>>> 1. Users may want to use something other than hash-based partitioning
>>>>>>>>>> 2. Users may wish to partition on something different than the key
>>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>>  1. A combination of fields in the value in conjunction with the key
>>>>>>>>>>  2. Something other than the key
>>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence in
>>>>>>>>>> KStream.to and KStream.through, so it makes sense for API consistency.
>>>>>>>>>> 
>>>>>>>>>> Just my  2 cents.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Bill
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:levani.codes@gmail.com <ma...@gmail.com>>>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi John,
>>>>>>>>>>> 
>>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>>> If we add partitioner configuration to Repartitioned class, with the
>>>>>>>>>>> combination of specifying number of partitions for internal topics, user
>>>>>>>>>>> will have opportunity to ensure co-partitioning before join operation.
>>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Levani
>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <john@confluent.io <ma...@confluent.io> <mailto:john@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally sure it
>>>>>>>>>>>> makes sense, but I believe something similar is the rationale for
>>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> -John
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com> <mailto:levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hey John,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>>> Do I understand this correctly, in your example I would first issue
>>>>>>>>>>> repartition(Repartitioned) with proper partitioner that essentially
>>>>>>>>> would
>>>>>>>>>>> be the same as the topic I want to join with and then do the
>>>>>>>>> KStream#join
>>>>>>>>>>> with DSL?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Levani
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <john@confluent.io <ma...@confluent.io> <mailto:john@confluent.io <ma...@confluent.io>>>
>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I think it might be useful to have an option to specify the
>>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may get
>>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the right-side
>>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
>>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the same
>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Does that make sense, or did I maybe miss something important?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com> <mailto:levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m not sure
>>>>>>>>> about
>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would need control
>>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best how to
>>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for topics
>>>>>>>>> that
>>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to specify
>>>>>>>>> the
>>>>>>>>>>> partitioner.
>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess specifying the
>>>>>>>>>>> partitioner can be added as well if there’s agreement on this.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:sophie@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned would be a
>>>>>>>>>>> useful
>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to Produced? I'm not
>>>>>>>>>>> sure how
>>>>>>>>>>>>>>>> widely this feature is really used, but seems it should be
>>>>>>>>> available
>>>>>>>>>>> for
>>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:levani.codes@gmail.com <ma...@gmail.com>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over the topic
>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that is missing
>>>>>>>>> in
>>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by configuring
>>>>>>>>> num
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:sophie@confluent.io <ma...@confluent.io>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me -- for the
>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned, will the
>>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case for the
>>>>>>>>>>> signature
>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be pre-created? The
>>>>>>>>>>> wording
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method will
>>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:levani.codes@gmail.com <ma...@gmail.com>>>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>>
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
>>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns about this KIP.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated confluence
>>>>>>>>> page
>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic` class for
>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>> configuration for internal operators like `groupedBy`.
>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this as well.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
>>>>>>>>>>> matthias@confluent.io
>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for summarizing
>>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed already (can't
>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to refresh the
>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard to the
>>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and operators that use
>>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might indicate that
>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses use-defined
>>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an operator that
>>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is one
>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but sometimes
>>>>>>>>> not; a
>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction clearer.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am wondering
>>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding it is
>>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of others on
>>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow to
>>>>>>>>> configure
>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we assume that
>>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method instead of
>>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other configs, like
>>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the need to
>>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like them
>>>>>>>>> myself.
>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe enhancing Produced
>>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach. Let me
>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this configuration,
>>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is the final
>>>>>>>>>>> sink of
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable assumption that user
>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that case, having
>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API contract, seems
>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>> Produced is designed to be something that is explicitly for
>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all are producer
>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
>>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level configurations
>>>>>>>>>>> together in
>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like Produced
>>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g topics created
>>>>>>>>> and
>>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we should leave
>>>>>>>>>>> it as
>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we should
>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should create and
>>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that should be
>>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions configuration in
>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level configuration in
>>>>>>>>>>> one
>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in KStream#through - I
>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users possibility to
>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :) Considering arguments
>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two options
>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and num of
>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic will be auto
>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final Produced<K, V>
>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with specified num of
>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce new method
>>>>>>>>> -
>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this in one of
>>>>>>>>> the
>>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the following
>>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to Grouped class (as
>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to KStream#through
>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and num of
>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic will be auto
>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final Produced<K, V>
>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with specified num of
>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to Grouped class (as
>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for creating and
>>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already discussed in the
>>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were about this
>>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around KIP-221. Going
>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement around
>>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I understood, the
>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through method
>>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
>>>>>>>>> configuration
>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to control
>>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful operations.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
>>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
> 


Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by "Matthias J. Sax" <ma...@confluent.io>.
Thanks a lot Levani!

On 11/16/19 4:00 AM, Levani Kokhreidze wrote:
> Matthias,
> 
> Yes, I agree. KIP is updated: https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+DSL+with+Connecting+Topic+Creation+and+Repartition+Hint <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+DSL+with+Connecting+Topic+Creation+and+Repartition+Hint> and follow-up JIRA ticket is linked in “Rejected Alternatives” section.
> 
> Thank you all for an interesting discussion.
> 
> Kind Regards,
> Levani
> 
>> On Nov 16, 2019, at 10:11 AM, Matthias J. Sax <ma...@confluent.io> wrote:
>>
>> Levani,
>>
>> do you agree to the current proposal? It's basically a de-scoping of the
>> already voted KIP. If you agree, could you update the KIP wiki page
>> accordingly, including the "Rejected Alternative" section (and mabye a
>> link to a follow up Jira ticket).
>>
>> Because it's a descope, and John and myself support it, there seems to
>> be no need to re-vote.
>>
>> @Sophie,John: thanks a lot for your thoughtful input!
>>
>>
>> -Matthias
>>
>> On 11/15/19 12:47 PM, John Roesler wrote:
>>> Thanks Sophie,
>>>
>>> I think your concern is valid, and also that your idea to make a
>>> ticket is a good idea.
>>>
>>> Creating a ticket has some very positive effects:
>>> * It allows us to record the thinking at this point in time so we
>>> don't have to dig through the mail archives later
>>> * It demonstrates that we did consider the use case, and do want to
>>> address it, but just don't feel confident to implement it right now.
>>> Then, if/when people do have a problem with the gap, the ticket it
>>> already there for them to consider, request, or even pick up.
>>>
>>> Since one aspect of the deferral is a desire to wait for real use
>>> experience, we should explicitly mention that in the ticket. This is
>>> just good information for people browsing the Jira looking for
>>> interesting tickets to pick up. They could still pick it up, but they
>>> can ask themselves if they really understand the real-world use cases
>>> any better than we do right now.
>>>
>>> Thanks, likewise, to you for the good discussion!
>>> -John
>>>
>>> On Fri, Nov 15, 2019 at 2:37 PM Sophie Blee-Goldman <so...@confluent.io> wrote:
>>>>
>>>> While I'm concerned that "not augmenting groupBy as part of this KIP"
>>>> really translates to "will not get around to augmenting groupBy for a long
>>>> time if not as part of this KIP", like I said I don't want to hold up the
>>>> new
>>>> .repartition operator that it seems we do, at least, all agree on. It's a
>>>> fair
>>>> point that we can always add this in later, but undoing it is far more
>>>> problematic.
>>>>
>>>> Anyways, I would be happy if we at least make a ticket to consider adding a
>>>> "number of partitions" option/suggestion to groupBy, so that we don't lose
>>>> all the thought put in to this decision so far and can avoid rehashing the
>>>> same
>>>> argument word for word and have something to point to when someone
>>>> asks "why didn't we add this numPartitions option to groupBy".
>>>>
>>>> Beyond that, if the community isn't pushing for it at this moment then it
>>>> seems very
>>>> reasonable to shelve the idea for now so that the rest of this KIP can
>>>> proceed.
>>>> Without input one way or another it's hard to say what the right thing to
>>>> do is,
>>>> which makes the right thing to do "wait to add this feature"
>>>>
>>>> Thanks for the good discussion everyone,
>>>>
>>>> Sophie
>>>>
>>>> On Fri, Nov 15, 2019 at 12:41 PM John Roesler <jo...@confluent.io> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I think that Sophie is asking a good question, and I do think that
>>>>> such "blanket configurations" are plausible. For example, we currently
>>>>> support (and I would encourage) "I don't know if this is going to
>>>>> create a repartition topic, but if it does, then use this name instead
>>>>> of generating one".
>>>>>
>>>>> I'm not sure I'm convinced that specifying max parallelism falls into
>>>>> this category. After all, the groupByKey+aggregate will be executed
>>>>> with _some_ max parallelism. It's either the same as the inputs'
>>>>> partition count or overridden with the proposed config. It seems
>>>>> counterintuitive to override the specified option with the default
>>>>> value.
>>>>>
>>>>> I'm not sure if I can put my finger on it, but "maybe use this name"
>>>>> seems way more reasonable to me than "maybe execute with this degree
>>>>> of parallelism".
>>>>>
>>>>> I do think (and I appreciate that this is where Sophie's example is
>>>>> coming from) that Streams should strive to be absolutely as simple and
>>>>> intuitive as possible (while still maintaining correctness). Optimal
>>>>> performance can be at odds with API simplicity. For example, the
>>>>> simplest behavior is, if you ask for 5 partitions, you get 5
>>>>> partitions. Maybe a repartition is technically not necessary (if you
>>>>> didn't change the key), but at least there's no mystery to this
>>>>> behavior.
>>>>>
>>>>> Clearly, an (opposing) tenent of simplicity is trying to prevent
>>>>> people from making mistakes, which I think is what the example boils
>>>>> down to. Sometimes, we can prevent clear mistakes, like equi-joining
>>>>> two topics with different partition counts. But for this case, it
>>>>> doesn't seem as clear-cut to be able to assume that they _said_ 5
>>>>> partitions, but they didn't really _want_ 5 partitions. Maybe we can
>>>>> just try to be clear in the documentation, and also even log a warning
>>>>> when we parse the topology, "hey, I've been asked to repartition this
>>>>> stream, but it's not necessary".
>>>>>
>>>>> If anything, this discussion really supports to me the value in just
>>>>> sticking with `repartition()` for now, and deferring
>>>>> `groupBy[Key](partitions)` to the future.
>>>>>
>>>>>> Users should not have to choose between allowing Streams to optimize the
>>>>> repartition placement, and allowing to specify a number of partitions.
>>>>>
>>>>> This is a very fair point, and it may be something that we rapidly
>>>>> return to, but it seems safe for now to introduce the non-optimizable
>>>>> `reparition()` only, and then consider optimization options later.
>>>>> Skipping available optimizations will never break correctness, but
>>>>> adding optimizations can, so it makes sense to treat them with
>>>>> caution.
>>>>>
>>>>> In conclusion, I do think that a use _could_ want to "maybe specify"
>>>>> the partition count, but I also think we can afford to pass on
>>>>> supporting this right now.
>>>>>
>>>>> I'm open to continuing the discussion, but just to avoid ambiguity, I
>>>>> still feel we should _not_ change the groupBy[Key] operation at all,
>>>>> and we should only add `repartition()` as a non-optimizable operation.
>>>>>
>>>>> Thanks all,
>>>>> -John
>>>>>
>>>>> On Fri, Nov 15, 2019 at 11:26 AM Levani Kokhreidze
>>>>> <le...@gmail.com> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Just fyi, PR was updated and now it incorporates the latest suggestions
>>>>> about joins.
>>>>>> `CopartitionedTopicsEnforcer` will throw an exception if number of
>>>>> partitions aren’t the same when using `repartition` operation along with
>>>>> `join`.
>>>>>>
>>>>>> For more details please take a look at the PR:
>>>>> https://github.com/apache/kafka/pull/7170/files <
>>>>> https://github.com/apache/kafka/pull/7170/files>
>>>>>>
>>>>>> Regards,
>>>>>> Levani
>>>>>>
>>>>>>
>>>>>>> On Nov 15, 2019, at 11:01 AM, Matthias J. Sax <ma...@confluent.io>
>>>>> wrote:
>>>>>>>
>>>>>>> Thanks a lot for the input Sophie.
>>>>>>>
>>>>>>> Your example is quite useful, and I would use it to support my claim
>>>>>>> that a "partition hint" for `Grouped` seems "useless" and does not
>>>>>>> improve the user experience.
>>>>>>>
>>>>>>> 1) You argue that a new user would be worries about repartitions topics
>>>>>>> with too many paritions. This would imply that a user is already
>>>>>>> advanced enough to understand the implication of repartitioning -- for
>>>>>>> this case, I would argue that a user also understand _when_ a
>>>>>>> auto-repartitioning would happen and thus the users understands where
>>>>> to
>>>>>>> insert a `repartition()` operation.
>>>>>>>
>>>>>>> 2) For specifying Serdes: if a `groupByKey()` does not trigger
>>>>>>> auto-repartitioning it's not required to specify the serdes and if they
>>>>>>> are specified they would be ignored/unused (note, that `groupBy()`
>>>>> would
>>>>>>> always trigger a repartitioning). Of course, if the default Serdes from
>>>>>>> the config match (eg, all data types are Json anyway), a user does not
>>>>>>> need to worry about specifying serdes. -- For new user that play
>>>>> around,
>>>>>>> I would assume that they work a lot with primitive types and thus would
>>>>>>> need to specify the serdes -- hence, they would learn about
>>>>>>> auto-repartitioning the hard way anyhow, because each time a
>>>>>>> `groupByKey()` does trigger auto-repartioning, they would need to pass
>>>>>>> in the correct Serdes -- this way, they would also be educated where to
>>>>>>> insert a `repartition()` operator if needed.
>>>>>>>
>>>>>>> 3) If a new user really just "plays around", I don't think they use an
>>>>>>> input topic with 100 partitions but most likely have a local single
>>>>> node
>>>>>>> broker with most likely single partitions topics.
>>>>>>>
>>>>>>>
>>>>>>> My main argument for my current proposal is however, that---based on
>>>>>>> past experience---it's better to roll out a new feature more carefully
>>>>>>> and see how it goes. Last, as John pointed out, we can still extend the
>>>>>>> feature in the future. Instead of making a judgment call up-front,
>>>>> being
>>>>>>> more conservative and less fancy, and revisit the design based on
>>>>>>> actuall user feedback after the first version is rolled out, seems to
>>>>> be
>>>>>>> the better option. Undoing a feature is must harder than extending it.
>>>>>>>
>>>>>>>
>>>>>>> While I advocate strong for a simple first version of this feature,
>>>>> it's
>>>>>>> a community decission in the end, and I would not block this KIP if
>>>>>>> there is a broad preference to add `Grouped#withNumberOfPartitions()`
>>>>>>> either.
>>>>>>>
>>>>>>>
>>>>>>> -Matthias
>>>>>>>
>>>>>>> On 11/14/19 11:35 PM, Sophie Blee-Goldman wrote:
>>>>>>>> It seems like we all agree at this point (please correct me if
>>>>> wrong!) that
>>>>>>>> we should NOT change
>>>>>>>> the existing repartitioning behavior, ie we should allow Streams to
>>>>>>>> continue to determine when and
>>>>>>>> where to repartition -- *unless* explicitly informed to by the use of
>>>>> a
>>>>>>>> .through or the new .repartition operator.
>>>>>>>>
>>>>>>>> Regarding groupBy, the existing behavior we should not disrupt is
>>>>>>>> a) repartition *only* when required due to upstream key-changing
>>>>> operation
>>>>>>>> (ie don't force repartitioning
>>>>>>>> based on the presence of an optional config parameter), and
>>>>>>>> b) allow optimization of required repartitions, if any
>>>>>>>>
>>>>>>>> Within the constraint of not breaking the existing behavior, this
>>>>> still
>>>>>>>> leaves open the question of whether we
>>>>>>>> want to improve the user experience by allowing to provide groupBy
>>>>> with a
>>>>>>>> *suggestion* for numPartitions (or to
>>>>>>>> put it more fairly, whether that *will* improve the experience). I
>>>>> agree
>>>>>>>> with many of the arguments outlined above but
>>>>>>>> let me just push back on this one issue one final time, and if we
>>>>> can't
>>>>>>>> come to a consensus then I am happy to drop
>>>>>>>> it for now so that the KIP can proceed.
>>>>>>>>
>>>>>>>> Specifically, my proposal would be to simply augment Grouped with an
>>>>>>>> optional numPartitions, understood to
>>>>>>>> indicate the user's desired number of partitions *if Streams decides
>>>>> to
>>>>>>>> repartition due to that groupBy*
>>>>>>>>
>>>>>>>>> if a user cares about the number of partition, the user wants to
>>>>> enforce
>>>>>>>> a repartitioning
>>>>>>>> First, I think we should take a step back and examine this claim. I
>>>>> agree
>>>>>>>> 100% that *if this is true,*
>>>>>>>> *then we should not give groupBy an optional numPartitions.* As far
>>>>> as I
>>>>>>>> see it, there's no argument
>>>>>>>> to be had there if we *presuppose that claim.* But I'm not convinced
>>>>> in
>>>>>>>> that as an axiom of the user
>>>>>>>> experience and think we should be examining that claim itself, not the
>>>>>>>> consequences of it.
>>>>>>>>
>>>>>>>> To give a simple example, let's say some new user is trying out
>>>>> Streams and
>>>>>>>> wants to just play around
>>>>>>>> with it to see if it might be worth looking into. They want to just
>>>>> write
>>>>>>>> up a simple app and test it out on the
>>>>>>>> data in some existing topics they have with a large number of
>>>>> partitions,
>>>>>>>> and a lot of data. They're just messing
>>>>>>>> around, trying new topologies and don't want to go through each new
>>>>> one
>>>>>>>> step by step to determine if (or where)
>>>>>>>> a repartition might be required. They also don't want to force a
>>>>>>>> repartition if it turns out to not be required, so they'd
>>>>>>>> like to avoid the nice new .repartition operator they saw. But given
>>>>> the
>>>>>>>> huge number of input partitions, they'd like
>>>>>>>> to rest assured that if a repartition does end up being required
>>>>> somewhere
>>>>>>>> during dev, it will not be created with
>>>>>>>> the same huge number of partitions that their input topic has -- so
>>>>> they
>>>>>>>> just pass groupBy a small numPartitions
>>>>>>>> suggestion.
>>>>>>>>
>>>>>>>> I know that's a bit of a contrived example but I think it does
>>>>> highlight
>>>>>>>> how and when this might be a considerable
>>>>>>>> quality of life improvement, in particular for new users to Streams
>>>>> and/or
>>>>>>>> during the dev cycle. *You don't want to*
>>>>>>>> *force a repartition if it wasn't necessary, but you don't want to
>>>>> create a
>>>>>>>> topic with a huge partition count either.*
>>>>>>>>
>>>>>>>> Also, while the optimization discussion took us down an interesting
>>>>> but
>>>>>>>> ultimately more distracting road, it's worth
>>>>>>>> pointing out that it is clearly a major win to have as few
>>>>>>>> repartition topics/steps as possible. Given that we
>>>>>>>> don't want to change existing behavior, the optimization framework
>>>>> can only
>>>>>>>> help out when the placement of
>>>>>>>> repartition steps is flexible, which means only those from .groupBy
>>>>> (and
>>>>>>>> not .repartition). *Users should not*
>>>>>>>> *have to choose between allowing Streams to optimize the repartition
>>>>>>>> placement, and allowing to specify a *
>>>>>>>> *number of partitions.*
>>>>>>>>
>>>>>>>> Lastly, I have what may be a stupid question but for my own
>>>>> edification of
>>>>>>>> how groupBy works:
>>>>>>>> if you do a .groupBy and a repartition is NOT required, does it ever
>>>>> need
>>>>>>>> to serialize/deserialize
>>>>>>>> any of the data? In other words, if you pass a key/value serde to
>>>>> groupBy
>>>>>>>> and it doesn't trigger
>>>>>>>> a repartition, is the serde(s) just ignored and thus more like a
>>>>> suggestion
>>>>>>>> than a requirement?
>>>>>>>>
>>>>>>>> So again, I don't want to hold up this KIP forever but I feel we've
>>>>> spent
>>>>>>>> some time getting slightly
>>>>>>>> off track (although certainly into very interesting discussions) yet
>>>>> never
>>>>>>>> really addressed or questioned
>>>>>>>> the basic premise: *could a user want to specify a number of
>>>>> partitions but
>>>>>>>> not enforce a repartition (at that*
>>>>>>>> *specific point in the topology)?*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Nov 15, 2019 at 12:18 AM Matthias J. Sax <
>>>>> matthias@confluent.io>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Side remark:
>>>>>>>>>
>>>>>>>>> If the user specifies `repartition()` on both side of the join, we
>>>>> can
>>>>>>>>> actually throw the execption earlier, ie, when we build the topology.
>>>>>>>>>
>>>>>>>>> Current, we can do this check only after Kafka Streams was started,
>>>>>>>>> within `StreamPartitionAssignor#assign()` -- we still need to keep
>>>>> this
>>>>>>>>> check for the case that none or only one side has a user specified
>>>>>>>>> number of partitions though.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Matthias
>>>>>>>>>
>>>>>>>>> On 11/14/19 8:15 AM, John Roesler wrote:
>>>>>>>>>> Thanks, all,
>>>>>>>>>>
>>>>>>>>>> I can get behind just totally leaving out reparation-via-groupBy. If
>>>>>>>>>> we only introduce `repartition()` for now, we're making the minimal
>>>>>>>>>> change to gain the desired capability.
>>>>>>>>>>
>>>>>>>>>> Plus, since we agree that `repartition()` should never be
>>>>> optimizable,
>>>>>>>>>> it's a future-compatible proposal. I.e., if we were to add a
>>>>>>>>>> non-optimizable groupBy(partitions) operation now, and want to make
>>>>> it
>>>>>>>>>> optimizable in the future, we have to worry about topology
>>>>>>>>>> compatibility. Better to just do non-optimizable `repartition()`
>>>>> now,
>>>>>>>>>> and add an optimizable `groupBy(partitions)` in the future (maybe).
>>>>>>>>>>
>>>>>>>>>> About joins, yes, it's a concern, and IMO we should just do the same
>>>>>>>>>> thing we do now... check at runtime that the partition counts on
>>>>> both
>>>>>>>>>> sides match and throw an exception otherwise. What this means as a
>>>>>>>>>> user is that if you explicitly repartition the left side to 100
>>>>>>>>>> partitions, and then join with the right side at 10 partitions, you
>>>>>>>>>> get an exception, since this operation is not possible. You'd either
>>>>>>>>>> have to "step down" the left side again, back to 10 partitions, or
>>>>> you
>>>>>>>>>> could repartition the right side to 100 partitions before the join.
>>>>>>>>>> The choice has to be the user's, since it depends on their desired
>>>>>>>>>> execution parallelism.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> -John
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 14, 2019 at 12:55 AM Matthias J. Sax <
>>>>> matthias@confluent.io>
>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Thanks a lot John. I think the way you decompose the operators is
>>>>> super
>>>>>>>>>>> helpful for this discussion.
>>>>>>>>>>>
>>>>>>>>>>> What you suggest with regard to using `Grouped` and enforcing
>>>>>>>>>>> repartitioning if the number of partitions is specified is
>>>>> certainly
>>>>>>>>>>> possible. However, I am not sure if we _should_ do this. My
>>>>> reasoning is
>>>>>>>>>>> that an enforce repartitioning as introduced via `repartition()`
>>>>> is an
>>>>>>>>>>> expensive operations, and it seems better to demand an more
>>>>> explicit
>>>>>>>>>>> user opt-in to trigger it. Just setting an optional parameter
>>>>> might be
>>>>>>>>>>> too subtle to trigger such a heavy "side effect".
>>>>>>>>>>>
>>>>>>>>>>> While I agree about "usability" in general, I would prefer a more
>>>>>>>>>>> conservative appraoch to introduce this feature, see how it goes,
>>>>> and
>>>>>>>>>>> maybe make it more advance later on. This also applies to what
>>>>>>>>>>> optimzation we may or may not allow (or are able to perform at
>>>>> all).
>>>>>>>>>>>
>>>>>>>>>>> @Levani: Reflecting about my suggestion about `Repartioned extends
>>>>>>>>>>> Grouped`, I agree that it might not be a good idea.
>>>>>>>>>>>
>>>>>>>>>>> Atm, I see an enforces repartitioning as non-optimizable and as a
>>>>> good
>>>>>>>>>>> first step and I would suggest to not intoruce anything else for
>>>>> now.
>>>>>>>>>>> Introducing optimizable enforce repartitioning via `groupBy(...,
>>>>>>>>>>> Grouped)` is something we could add later.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Therefore, I would not change `Grouped` but only introduce
>>>>>>>>>>> `repartition()`. Users that use `grouBy()` atm, and want to opt-in
>>>>> to
>>>>>>>>>>> set the number of partitions, would need to rewrite their code to
>>>>>>>>>>> `selectKey(...).repartition(...).groupByKey()`. It's less
>>>>> convinient but
>>>>>>>>>>> also less risky from an API and optimization point of view.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> @Levani: about joins -> yes, we will need to check the specified
>>>>> number
>>>>>>>>>>> of partitions (if any) and if they don't match, throw an
>>>>> exception. We
>>>>>>>>>>> can discuss this on the PR -- I am just trying to get the PR for
>>>>> KIP-466
>>>>>>>>>>> merged -- your is next on the list :)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thoughts?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Matthias
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 11/12/19 4:51 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>> Thank you all for an interesting discussion. This is very
>>>>> enlightening.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you Matthias for your explanation. Your arguments are very
>>>>> true.
>>>>>>>>> It makes sense that if user specifies number of partitions he/she
>>>>> really
>>>>>>>>> cares that those specifications are applied to internal topics.
>>>>>>>>>>>> Unfortunately, in current implementation this is not true during
>>>>>>>>> `join` operation. As I’ve written in the PR comment, currently, when
>>>>>>>>> `Stream#join` is used, `CopartitionedTopicsEnforcer` chooses max
>>>>> number of
>>>>>>>>> partitions from the two source topics.
>>>>>>>>>>>> I’m not really sure what would be the other way around this
>>>>> situation.
>>>>>>>>> Maybe fail the stream altogether and inform the user to specify same
>>>>> number
>>>>>>>>> of partitions?
>>>>>>>>>>>> Or we should treat join operations in a same way as it is right
>>>>> now
>>>>>>>>> and basically choose max number of partitions even when `repartition`
>>>>>>>>> operation is specified, because Kafka Streams “knows the best” how to
>>>>>>>>> handle joins?
>>>>>>>>>>>> You can check integration tests how it’s being handled currently.
>>>>> Open
>>>>>>>>> to suggestions on that part.
>>>>>>>>>>>>
>>>>>>>>>>>> As for groupBy, I agree and John raised very interesting points.
>>>>> My
>>>>>>>>> arguments for allowing users to specify number of partitions during
>>>>> groupBy
>>>>>>>>> operations mainly was coming from the usability perspective.
>>>>>>>>>>>> So building on top of what John said, maybe it makes sense to make
>>>>>>>>> `groupBy` operations smarter and whenever user specifies
>>>>>>>>> `numberOfPartitions` configuration, repartitioning will be enforced,
>>>>> wdyt?
>>>>>>>>>>>> I’m not going into optimization part yet :) I think it will be
>>>>> part of
>>>>>>>>> separate PR and task, but overall it makes sense to apply
>>>>> optimizations
>>>>>>>>> where number of partitions are the same.
>>>>>>>>>>>>
>>>>>>>>>>>> As for Repartitioned extending Grouped, I kinda feel that it
>>>>> won’t fit
>>>>>>>>> nicely in current API design.
>>>>>>>>>>>> In addition, in the PR review, John mentioned that there were a
>>>>> lot of
>>>>>>>>> troubles in the past trying to use one operation's configuration
>>>>> objects on
>>>>>>>>> other operations.
>>>>>>>>>>>> Also it makes sense to keep them separate in terms of
>>>>> compatibility.
>>>>>>>>>>>> In that case, we don’t have to worry every time Grouped is
>>>>> changed,
>>>>>>>>> what would be the implications on `repartition` operations.
>>>>>>>>>>>>
>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>> Levani
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Nov 11, 2019, at 9:13 PM, John Roesler <jo...@confluent.io>
>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ah, thanks for the clarification. I missed your point.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I like the framework you've presented. It does seem simpler to
>>>>> assume
>>>>>>>>>>>>> that they either care about the partition count and want to
>>>>>>>>>>>>> repartition to realize it, or they don't care about the number.
>>>>>>>>>>>>> Returning to this discussion, it does seem unlikely that they
>>>>> care
>>>>>>>>>>>>> about the number and _don't_ care if it actually gets realized.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But then, it still seems like we can just keep the option as
>>>>> part of
>>>>>>>>>>>>> Grouped. As in:
>>>>>>>>>>>>>
>>>>>>>>>>>>> // user does not care
>>>>>>>>>>>>> stream.groupByKey(Grouped /*not specifying partition count*/)
>>>>>>>>>>>>> stream.groupBy(Grouped /*not specifying partition count*/)
>>>>>>>>>>>>>
>>>>>>>>>>>>> // user does care
>>>>>>>>>>>>> stream.repartition(Repartitioned)
>>>>>>>>>>>>> stream.groupByKey(Grouped.numberOfPartitions(...))
>>>>>>>>>>>>> stream.groupBy(Grouped.numberOfPartitions(...))
>>>>>>>>>>>>>
>>>>>>>>>>>>> ----
>>>>>>>>>>>>>
>>>>>>>>>>>>> The above discussion got me thinking about algebra. Matthias is
>>>>>>>>>>>>> absolutely right that `groupByKey(numPartitions)` is equivalent
>>>>> to
>>>>>>>>>>>>> `repartition(numPartitions).groupByKey()`. I'm just not
>>>>> convinced that
>>>>>>>>>>>>> we should force people to apply that expansion themselves vs.
>>>>> having a
>>>>>>>>>>>>> more compact way to express it if they don't care where exactly
>>>>> the
>>>>>>>>>>>>> repartition occurs. However, thinking about these operators
>>>>>>>>>>>>> algebraically can really help *us* narrow down the number of
>>>>> different
>>>>>>>>>>>>> expressions we have to consider.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let's consider some identities:
>>>>>>>>>>>>>
>>>>>>>>>>>>> A: groupBy(mapper) + agg = mapKey(mapper) + groupByKey + agg
>>>>>>>>>>>>> B: src + ... + groupByKey + agg = src + ... + passthough + agg
>>>>>>>>>>>>> C: mapKey(mapper) + ... + groupByKey + agg
>>>>>>>>>>>>> = mapKey(mapper) + ... + repartition + groupByKey + agg
>>>>>>>>>>>>> D: repartition = sink(managed) + src
>>>>>>>>>>>>>
>>>>>>>>>>>>> In these identities, I used one special identifier (...), which
>>>>> means
>>>>>>>>>>>>> any number (0+) of operations that are not src, mapKey,
>>>>> groupBy[Key],
>>>>>>>>>>>>> repartition, or agg.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For mental clarity, I'm just going to make up a rule that groupBy
>>>>>>>>>>>>> operations are not executable. In other words, you have to get
>>>>> to a
>>>>>>>>>>>>> point where you can apply B to convert a groupByKey into a
>>>>> passthough
>>>>>>>>>>>>> in order to execute the program. This is just a formal way of
>>>>> stating
>>>>>>>>>>>>> what already happens in Kafka Streams.
>>>>>>>>>>>>>
>>>>>>>>>>>>> By applying A, we can just completely leave `groupBy` out of our
>>>>>>>>>>>>> analysis. It trivially decomposes into a mapKey followed by a
>>>>>>>>>>>>> groupByKey.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then, we can eliminate the "repartition required" case of
>>>>> `groupByKey`
>>>>>>>>>>>>> by applying C followed by D to get to the "no repartition
>>>>> required"
>>>>>>>>>>>>> version of groupByKey, which in turn sets us up to apply B to
>>>>> get an
>>>>>>>>>>>>> executable topology.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Fundamentally, you can think about KIP-221 is as proposing a
>>>>> modified
>>>>>>>>>>>>> D identity in which you can specify the partition count of the
>>>>> managed
>>>>>>>>>>>>> sink topic:
>>>>>>>>>>>>> D': repartition(pc) = sink(managed w/ pc) + src
>>>>>>>>>>>>>
>>>>>>>>>>>>> Since users _could_ apply the identities above, we don't
>>>>> actually have
>>>>>>>>>>>>> to add any partition count to groupBy[Key], but we decided early
>>>>> on in
>>>>>>>>>>>>> the KIP discussion that it's more ergonomic to add it. In that
>>>>> case,
>>>>>>>>>>>>> we also have to modify A and C:
>>>>>>>>>>>>> A': groupBy(mapper, pc) + agg
>>>>>>>>>>>>> = mapKey(mapper) + groupByKey(pc) + agg
>>>>>>>>>>>>> C': mapKey(mapper) + ... + groupByKey(pc) + agg
>>>>>>>>>>>>> = mapKey(mapper) + ... + repartition(pc) + groupByKey + agg
>>>>>>>>>>>>>
>>>>>>>>>>>>> Which sets us up still to always be able to get back to a plain
>>>>>>>>>>>>> `groupByKey` operation (with no `(pc)`) and then apply D' and
>>>>>>>>>>>>> ultimately B to get an executable topology.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What about the optimizer?
>>>>>>>>>>>>> The optimizer applies another set of graph-algebraic identities
>>>>> to
>>>>>>>>>>>>> minimize the number of repartition topics in a topology.
>>>>>>>>>>>>>
>>>>>>>>>>>>> (forgive my ascii art)
>>>>>>>>>>>>>
>>>>>>>>>>>>> E: (merging repartition nodes)
>>>>>>>>>>>>> (...) -> repartition -> X
>>>>>>>>>>>>> \-> repartition -> Y
>>>>>>>>>>>>> =
>>>>>>>>>>>>> (... + repartition) -> X
>>>>>>>>>>>>>   \-> Y
>>>>>>>>>>>>> F: (reordering around repartition)
>>>>>>>>>>>>> Where SVO is any non-key-changing, stateless, operation:
>>>>>>>>>>>>> repartition -> SVO = SVO -> repartition
>>>>>>>>>>>>>
>>>>>>>>>>>>> In terms of these identities, what the optimizer does is apply F
>>>>>>>>>>>>> repeatedly in either direction to a topology to factor out
>>>>> common in
>>>>>>>>>>>>> branches so that it can apply E to merge repartition nodes. This
>>>>> was
>>>>>>>>>>>>> especially necessary before KIP-221 because you couldn't directly
>>>>>>>>>>>>> express `repartition` in the DSL, only indirectly via
>>>>> `groupBy[Key]`,
>>>>>>>>>>>>> so there was no way to do the factoring manually.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We can now state very clearly that in KIP-221, explicit
>>>>>>>>>>>>> `repartition()` operators should create a "reordering barrier".
>>>>> So, F
>>>>>>>>>>>>> cannot be applied to an explicit `repartition()`. Also, I think
>>>>> we
>>>>>>>>>>>>> decided earlier that explicit `repartition()` operations would
>>>>> also be
>>>>>>>>>>>>> ineligible for merging, so E can't be applied to explicit
>>>>>>>>>>>>> `repartition()` operations either. I think we feel we _could_
>>>>> apply E
>>>>>>>>>>>>> without harm, but we want to be conservative for now.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think the salient point from the latter discussion has been
>>>>> that
>>>>>>>>>>>>> when you use `Grouped.numberOfPartitions`, this does _not_
>>>>> constitute
>>>>>>>>>>>>> an explicit `repartition()` operator, and therefore, the
>>>>> resulting
>>>>>>>>>>>>> repartition node remains eligible for optimization.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To be clear, I agree with Matthias that the provided partition
>>>>> count
>>>>>>>>>>>>> _must_ be used in the resulting implicit repartition. This has
>>>>> some
>>>>>>>>>>>>> implications for E. Namely, E could only be applied to two
>>>>> repartition
>>>>>>>>>>>>> nodes that have the same partition count. This has always been
>>>>>>>>>>>>> trivially true before KIP-221 because the partition count has
>>>>> always
>>>>>>>>>>>>> been "unspecified", i.e., it would be determined at runtime by
>>>>> the
>>>>>>>>>>>>> user-managed-topics' partition counts. Now, it could be
>>>>> specified or
>>>>>>>>>>>>> unspecified. We can simply augment E to allow merging only
>>>>> repartition
>>>>>>>>>>>>> nodes where the partition count is EITHER "specified and the
>>>>> same on
>>>>>>>>>>>>> both sides", OR "unspecified on both sides".
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry for the long email, but I have a hope that it builds a
>>>>> solid
>>>>>>>>>>>>> theoretical foundation for our decisions in KIP-221, so we can
>>>>> have
>>>>>>>>>>>>> confidence that there are no edge cases for design flaws to hide.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> -John
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Nov 9, 2019 at 10:37 PM Matthias J. Sax <
>>>>>>>>> matthias@confluent.io> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> it seems like we do want to allow
>>>>>>>>>>>>>>>> people to optionally specify a partition count as part of this
>>>>>>>>>>>>>>>> operation, but we don't want that option to _force_
>>>>> repartitioning
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Correct, ie, that is my suggestions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "Use P partitions if repartitioning is necessary"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I disagree here, because my reasoning is that:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - if a user cares about the number of partition, the user wants
>>>>> to
>>>>>>>>>>>>>> enforce a repartitioning
>>>>>>>>>>>>>> - if a user does not case about the number of partitions, we
>>>>> don't
>>>>>>>>> need
>>>>>>>>>>>>>> to provide them a way to pass in a "hint"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hence, it should be sufficient to support:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> // user does not care
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> `stream.groupByKey(Grouped)`
>>>>>>>>>>>>>> `stream.grouBy(..., Grouped)`
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> // user does care
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> `stream.repartition(Repartitioned).groupByKey()`
>>>>>>>>>>>>>> `streams.groupBy(..., Repartitioned)`
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 11/9/19 8:10 PM, John Roesler wrote:
>>>>>>>>>>>>>>> Thanks for those thoughts, Matthias,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I find your reasoning about the optimization behavior
>>>>> compelling.
>>>>>>>>> The
>>>>>>>>>>>>>>> `through` operation is very simple and clear to reason about.
>>>>> It
>>>>>>>>> just
>>>>>>>>>>>>>>> passes the data exactly at the specified point in the topology
>>>>>>>>> exactly
>>>>>>>>>>>>>>> through the specified topic. Likewise, if a user invokes a
>>>>>>>>>>>>>>> `repartition` operator, the simplest behavior is if we just do
>>>>> what
>>>>>>>>>>>>>>> they asked for.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Stepping back to think about when optimizations are surprising
>>>>> and
>>>>>>>>>>>>>>> when they aren't, it occurs to me that we should be free to
>>>>> move
>>>>>>>>>>>>>>> around repartitions when users have asked to perform some
>>>>> operation
>>>>>>>>>>>>>>> that implies a repartition, like "change keys, then filter,
>>>>> then
>>>>>>>>>>>>>>> aggregate". This program requires a repartition, but it could
>>>>> be
>>>>>>>>>>>>>>> anywhere between the key change and the aggregation. On the
>>>>> other
>>>>>>>>>>>>>>> hand, if they say, "change keys, then filter, then
>>>>> repartition, then
>>>>>>>>>>>>>>> aggregate", it seems like they were pretty clear about their
>>>>> desire,
>>>>>>>>>>>>>>> and we should just take it at face value.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So, I'm sold on just literally doing a repartition every time
>>>>> they
>>>>>>>>>>>>>>> invoke the `repartition` operator.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The "partition count" modifier for `groupBy`/`groupByKey` is
>>>>> more
>>>>>>>>> nuanced.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What you said about `groupByKey` makes sense. If they key
>>>>> hasn't
>>>>>>>>>>>>>>> actually changed, then we don't need to repartition before
>>>>>>>>>>>>>>> aggregating. On the other hand, `groupBy` is specifically
>>>>> changing
>>>>>>>>> the
>>>>>>>>>>>>>>> key as part of the grouping operation, so (as you said) we
>>>>>>>>> definitely
>>>>>>>>>>>>>>> have to do a repartition.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If I'm reading the discussion right, it seems like we do want
>>>>> to
>>>>>>>>> allow
>>>>>>>>>>>>>>> people to optionally specify a partition count as part of this
>>>>>>>>>>>>>>> operation, but we don't want that option to _force_
>>>>> repartitioning
>>>>>>>>> if
>>>>>>>>>>>>>>> it's not needed. That last clause is the key. "Use P
>>>>> partitions if
>>>>>>>>>>>>>>> repartitioning is necessary" is a directive that applies
>>>>> cleanly and
>>>>>>>>>>>>>>> correctly to both `groupBy` and `groupByKey`. What if we call
>>>>> the
>>>>>>>>>>>>>>> option `numberOfPartitionsHint`, which along with the "if
>>>>> necessary"
>>>>>>>>>>>>>>> javadoc, should make it clear that the option won't force a
>>>>>>>>>>>>>>> repartition, and also gives us enough latitude to still employ
>>>>> the
>>>>>>>>>>>>>>> optimizer on those repartition topics?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If we like the idea of expressing it as a "hint" for grouping
>>>>> and a
>>>>>>>>>>>>>>> "command" for `repartition`, then it seems like it still makes
>>>>> sense
>>>>>>>>>>>>>>> to keep Grouped and Repartitioned separate, as they would
>>>>> actually
>>>>>>>>>>>>>>> offer different methods with distinct semantics.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> WDYT?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <
>>>>>>>>> matthias@confluent.io> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sorry for late reply.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I guess, the question boils down to the intended semantics of
>>>>>>>>>>>>>>>> `repartition()`. My understanding is as follows:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - KS does auto-repartitioning for correctness reasons (using
>>>>> the
>>>>>>>>>>>>>>>> upstream topic to determine the number of partitions)
>>>>>>>>>>>>>>>> - KS does auto-repartitioning only for downstream DSL
>>>>> operators
>>>>>>>>> like
>>>>>>>>>>>>>>>> `count()` (eg, a `transform()` does never trigger an
>>>>>>>>> auto-repartitioning
>>>>>>>>>>>>>>>> even if the stream is marked as `repartitioningRequired`).
>>>>>>>>>>>>>>>> - KS offers `through()` to enforce a repartitioning --
>>>>> however,
>>>>>>>>> the user
>>>>>>>>>>>>>>>> needs to create the topic manually (with the desired number of
>>>>>>>>> partitions).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I see two main applications for `repartitioning()`:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) repartition data before a `transform()` but user does not
>>>>> want
>>>>>>>>> to
>>>>>>>>>>>>>>>> manage the topic
>>>>>>>>>>>>>>>> 2) scale out a downstream subtopology
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hence, I see `repartition()` similar to `through()`: if a
>>>>> users
>>>>>>>>> calls
>>>>>>>>>>>>>>>> it, a repartitining is enforced, with the difference that KS
>>>>>>>>> manages the
>>>>>>>>>>>>>>>> topic and the user does not need to create it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This behavior makes (1) and (2) possible.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think many users would prefer to just say "if there *is* a
>>>>>>>>> repartition
>>>>>>>>>>>>>>>>> required at this point in the topology, it should
>>>>>>>>>>>>>>>>> have N partitions"
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Because of (2), I disagree. Either a user does not care about
>>>>>>>>> scaling
>>>>>>>>>>>>>>>> out, for which case she would not specify the number of
>>>>>>>>> partitions. Or a
>>>>>>>>>>>>>>>> user does care, and hence wants to enforce the scale out. I
>>>>> don't
>>>>>>>>> think
>>>>>>>>>>>>>>>> that any user would say, "maybe scale out".
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Therefore, the optimizer should never ignore the repartition
>>>>>>>>> operation.
>>>>>>>>>>>>>>>> As a "consequence" (because repartitioning is expensive) a
>>>>> user
>>>>>>>>> should
>>>>>>>>>>>>>>>> make an explicit call to `repartition()` IMHO -- piggybacking
>>>>> an
>>>>>>>>>>>>>>>> enforced repartitioning into `groupByKey()` seems to be
>>>>> "dangerous"
>>>>>>>>>>>>>>>> because it might be too subtle and an "optional scaling out"
>>>>> as
>>>>>>>>> laid out
>>>>>>>>>>>>>>>> above does not make sense IMHO.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am also not worried about "over repartitioning" because the
>>>>>>>>> result
>>>>>>>>>>>>>>>> stream would never trigger auto-repartitioning. Only if
>>>>> multiple
>>>>>>>>>>>>>>>> consecutive calls to `repartition()` are made it could be bad
>>>>> --
>>>>>>>>> but
>>>>>>>>>>>>>>>> that's the same with `through()`. In the end, there is always
>>>>> some
>>>>>>>>>>>>>>>> responsibility on the user.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Btw, for `.groupBy()` we know that repartitioning will be
>>>>> required,
>>>>>>>>>>>>>>>> however, for `groupByKey()` it depends if the KStream is
>>>>> marked as
>>>>>>>>>>>>>>>> `repartitioningRequired`.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hence, for `groupByKey()` it should not be possible for a
>>>>> user to
>>>>>>>>> set
>>>>>>>>>>>>>>>> number of partitions IMHO. For `groupBy()` it's a different
>>>>> story,
>>>>>>>>>>>>>>>> because calling
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> `repartition().groupBy()`
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> does not achieve what we want. Hence, allowing users to pass
>>>>> in the
>>>>>>>>>>>>>>>> number of users partitions into `groupBy()` does actually
>>>>> makes
>>>>>>>>> sense,
>>>>>>>>>>>>>>>> because repartitioning will happen anyway and thus we can
>>>>>>>>> piggyback a
>>>>>>>>>>>>>>>> scaling decision.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think that John has a fair concern about the overloads,
>>>>> however,
>>>>>>>>> I am
>>>>>>>>>>>>>>>> not convinced that using `Grouped` to specify the number of
>>>>>>>>> partitions
>>>>>>>>>>>>>>>> is intuitive. I double checked `Grouped` and `Repartitioned`
>>>>> and
>>>>>>>>> both
>>>>>>>>>>>>>>>> allow to specify a `name` and `keySerde/valueSerde`. Thus, I
>>>>> am
>>>>>>>>>>>>>>>> wondering if we could bridge the gap between both, if we
>>>>> would make
>>>>>>>>>>>>>>>> `Repartitioned extends Grouped`? For this case, we only need
>>>>>>>>>>>>>>>> `groupBy(Grouped)` and a user can pass in both types what
>>>>> seems to
>>>>>>>>> make
>>>>>>>>>>>>>>>> the API quite smooth:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> `stream.groupBy(..., Grouped...)`
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> `stream.groupBy(..., Repartitioned...)`
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>> Hi Sophie,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thank you for your reply, very insightful. Looking forward
>>>>>>>>> hearing others opinion as well on this.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <
>>>>>>>>> sophie@confluent.io> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
>>>>>>>>> other hand
>>>>>>>>>>>>>>>>>> Kafka Streams has already
>>>>>>>>>>>>>>>>>>> optimizer in place which alters topology independently
>>>>> from user
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I agree (with you) and think this is a good way to put it
>>>>> -- we
>>>>>>>>> currently
>>>>>>>>>>>>>>>>>> auto-repartition for the user so
>>>>>>>>>>>>>>>>>> that they don't have to walk through their entire topology
>>>>> and
>>>>>>>>> reason about
>>>>>>>>>>>>>>>>>> when and where to place a
>>>>>>>>>>>>>>>>>> `.through` (or the new `.repartition`), so why suddenly
>>>>> force
>>>>>>>>> this onto the
>>>>>>>>>>>>>>>>>> user? How certain are we that
>>>>>>>>>>>>>>>>>> users will always get this right? It's easy to imagine that
>>>>>>>>> during
>>>>>>>>>>>>>>>>>> development, you write your new app with
>>>>>>>>>>>>>>>>>> correctly placed repartitions in order to use this new
>>>>> feature.
>>>>>>>>> During the
>>>>>>>>>>>>>>>>>> course of development you end up
>>>>>>>>>>>>>>>>>> tweaking the topology, but don't remember to review or move
>>>>> the
>>>>>>>>>>>>>>>>>> repartitioning since you're used to Streams
>>>>>>>>>>>>>>>>>> doing this for you. If you use only single-partition topics
>>>>> for
>>>>>>>>> testing,
>>>>>>>>>>>>>>>>>> you might not even notice your app is
>>>>>>>>>>>>>>>>>> spitting out incorrect results!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Anyways, I feel pretty strongly that it would be weird to
>>>>>>>>> introduce a new
>>>>>>>>>>>>>>>>>> feature and say that to use it, you can't take
>>>>>>>>>>>>>>>>>> advantage of this other feature anymore. Also, is it
>>>>> possible our
>>>>>>>>>>>>>>>>>> optimization framework could ever include an
>>>>>>>>>>>>>>>>>> optimized repartitioning strategy that is better than what a
>>>>>>>>> user could
>>>>>>>>>>>>>>>>>> achieve by manually inserting repartitions?
>>>>>>>>>>>>>>>>>> Do we expect users to have a deep understanding of the best
>>>>> way
>>>>>>>>> to
>>>>>>>>>>>>>>>>>> repartition their particular topology, or is it
>>>>>>>>>>>>>>>>>> likely they will end up over-repartitioning either due to
>>>>> missed
>>>>>>>>>>>>>>>>>> optimizations or unnecessary extra repartitions?
>>>>>>>>>>>>>>>>>> I think many users would prefer to just say "if there *is* a
>>>>>>>>> repartition
>>>>>>>>>>>>>>>>>> required at this point in the topology, it should
>>>>>>>>>>>>>>>>>> have N partitions"
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> As to the idea of adding `numberOfPartitions` to Grouped
>>>>> rather
>>>>>>>>> than
>>>>>>>>>>>>>>>>>> adding a new parameter to groupBy, that does seem more in
>>>>> line
>>>>>>>>> with the
>>>>>>>>>>>>>>>>>> current syntax so +1 from me
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <
>>>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> While https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> is under
>>>>> review and
>>>>>>>>> it’s
>>>>>>>>>>>>>>>>>>> almost done, I want to resurrect discussion about this KIP
>>>>> to
>>>>>>>>> address
>>>>>>>>>>>>>>>>>>> couple of concerns raised by Matthias and John.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As a reminder, idea of the KIP-221 was to allow DSL users
>>>>>>>>> control over
>>>>>>>>>>>>>>>>>>> repartitioning and parallelism of sub-topologies by:
>>>>>>>>>>>>>>>>>>> 1) Introducing new KStream#repartition operation which is
>>>>> done
>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>
>>>>>>>>>>>>>>>>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which
>>>>> is
>>>>>>>>> planned to
>>>>>>>>>>>>>>>>>>> be separate PR.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> While all agree about general implementation and idea
>>>>> behind
>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> PR,
>>>>> introducing new
>>>>>>>>>>>>>>>>>>> KStream#groupBy(Repartitioned) method overload raised some
>>>>>>>>> questions during
>>>>>>>>>>>>>>>>>>> the review.
>>>>>>>>>>>>>>>>>>> Matthias raised concern that there can be cases when user
>>>>> uses
>>>>>>>>>>>>>>>>>>> `KStream#groupBy(Repartitioned)` operation, but actual
>>>>>>>>> repartitioning may
>>>>>>>>>>>>>>>>>>> not required, thus configuration passed via `Repartitioned`
>>>>>>>>> would never be
>>>>>>>>>>>>>>>>>>> applied (Matthias, please correct me if I misinterpreted
>>>>> your
>>>>>>>>> comment).
>>>>>>>>>>>>>>>>>>> So instead, if user wants to control parallelism of
>>>>>>>>> sub-topologies, he or
>>>>>>>>>>>>>>>>>>> she should always use `KStream#repartition` operation
>>>>> before
>>>>>>>>> groupBy. Full
>>>>>>>>>>>>>>>>>>> comment can be seen here:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On the same topic, John pointed out that, from API design
>>>>>>>>> perspective, we
>>>>>>>>>>>>>>>>>>> shouldn’t intertwine configuration classes of different
>>>>>>>>> operators between
>>>>>>>>>>>>>>>>>>> one another. So instead of introducing new
>>>>>>>>> `KStream#groupBy(Repartitioned)`
>>>>>>>>>>>>>>>>>>> for specifying number of partitions for internal topic, we
>>>>>>>>> should update
>>>>>>>>>>>>>>>>>>> existing `Grouped` class with `numberOfPartitions` field.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
>>>>>>>>> other hand
>>>>>>>>>>>>>>>>>>> Kafka Streams has already optimizer in place which alters
>>>>>>>>> topology
>>>>>>>>>>>>>>>>>>> independently from user. So maybe it makes sense if Kafka
>>>>>>>>> Streams,
>>>>>>>>>>>>>>>>>>> internally would optimize topology in the best way
>>>>> possible,
>>>>>>>>> even if in
>>>>>>>>>>>>>>>>>>> some cases this means ignoring some operator configurations
>>>>>>>>> passed by the
>>>>>>>>>>>>>>>>>>> user. Also, I agree with John about API design semantics.
>>>>> If we
>>>>>>>>> go through
>>>>>>>>>>>>>>>>>>> with the changes for `KStream#groupBy` operation, it makes
>>>>> more
>>>>>>>>> sense to
>>>>>>>>>>>>>>>>>>> add `numberOfPartitions` field to `Grouped` class instead
>>>>> of
>>>>>>>>> introducing
>>>>>>>>>>>>>>>>>>> new `KStream#groupBy(Repartitioned)` method overload.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I would really appreciate communities feedback on this.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <
>>>>>>>>> sophie@confluent.io>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I think people are busy with the upcoming 2.4 release, and
>>>>>>>>> don't have
>>>>>>>>>>>>>>>>>>> much
>>>>>>>>>>>>>>>>>>>> spare time at the
>>>>>>>>>>>>>>>>>>>> moment. It's kind of a difficult time to get attention on
>>>>>>>>> things, but
>>>>>>>>>>>>>>>>>>> feel
>>>>>>>>>>>>>>>>>>>> free to pick up something else
>>>>>>>>>>>>>>>>>>>> to work on in the meantime until things have calmed down
>>>>> a bit!
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Sorry for bringing this thread again, but I would like
>>>>> to get
>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>> attention on this PR:
>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> <
>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>>
>>>>>>>>>>>>>>>>>>>>> It's been a while now and I would love to move on to
>>>>> other
>>>>>>>>> KIPs as well.
>>>>>>>>>>>>>>>>>>>>> Please let me know if you have any concerns.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Here’s voting thread for this KIP:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>>>>>>>>>>>>>>>>
>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>
>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>>>>>>>>>>>>>>>>
>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html
>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>>>> levani.codes@gmail.com>>>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion
>>>>> on
>>>>>>>>> that one.
>>>>>>>>>>>>>>>>>>>>>>> Agree that avoiding unnecessary method overloads is a
>>>>> good
>>>>>>>>> idea.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Updated KIP
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <
>>>>>>>>> matthias@confluent.io
>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>
>>>>>>>>>>>>>>>>>>>>> <mailto:matthias@confluent.io <mailto:
>>>>> matthias@confluent.io>>>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> One question:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Why do we add
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Repartitioned#with(final String name, final int
>>>>>>>>> numberOfPartitions)
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> It seems that `#with(String name)`,
>>>>>>>>> `#numberOfPartitions(int)` in
>>>>>>>>>>>>>>>>>>>>>>>> combination with `withName()` and
>>>>>>>>> `withNumberOfPartitions()` should
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>> sufficient. Users can chain the method calls.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> (I think it's valuable to keep the number of overload
>>>>>>>>> small if
>>>>>>>>>>>>>>>>>>>>> possible.)
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Otherwise LGTM.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks all for your feedback.
>>>>>>>>>>>>>>>>>>>>>>>>> I started voting procedure for this KIP. If there’re
>>>>> any
>>>>>>>>> other
>>>>>>>>>>>>>>>>>>>>> concerns about this KIP, please let me know.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>
>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, makes sense.
>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve updated KIP (
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> ).
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
>>>>>>>>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>
>>>>>>>>>>>>>>>>>>>>> <mailto:matthias@confluent.io <mailto:
>>>>> matthias@confluent.io>>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>
>>>>> <mailto:
>>>>>>>>>>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>>>>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving the KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree that users need to be able to specify a
>>>>>>>>> partitioning
>>>>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Sophie raises a fair point about topic configs and
>>>>>>>>> producer
>>>>>>>>>>>>>>>>>>>>> configs. My
>>>>>>>>>>>>>>>>>>>>>>>>>>> take is, that consider `Repartitioned` as an
>>>>>>>>> "extension" to
>>>>>>>>>>>>>>>>>>>>> `Produced`,
>>>>>>>>>>>>>>>>>>>>>>>>>>> that adds topic configuration, is a good way to
>>>>> think
>>>>>>>>> about it and
>>>>>>>>>>>>>>>>>>>>> helps
>>>>>>>>>>>>>>>>>>>>>>>>>>> to keep the API "clean".
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> With regard to method names. I would prefer to
>>>>> avoid
>>>>>>>>>>>>>>>>>>> abbreviations.
>>>>>>>>>>>>>>>>>>>>> Can
>>>>>>>>>>>>>>>>>>>>>>>>>>> we rename:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Furthermore, it might be good to add some more
>>>>> `static`
>>>>>>>>> methods:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>>>>>>>>>>>>>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>>>>>>>>>>>>>>>>>>>>>>>> -
>>>>> Repartitioned.streamPartitioner(StreamPartitioner)
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Totally agree. I think in KStream interface it
>>>>> makes
>>>>>>>>> sense to
>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>> some duplicate configurations between operators in order
>>>>> to
>>>>>>>>> keep API
>>>>>>>>>>>>>>>>>>> simple
>>>>>>>>>>>>>>>>>>>>> and usable.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, as more surface API has, harder it is to
>>>>> have
>>>>>>>>> proper
>>>>>>>>>>>>>>>>>>>>> backward compatibility.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> While initial idea of keeping topic level configs
>>>>>>>>> separate was
>>>>>>>>>>>>>>>>>>>>> exciting, having Repartitioned class encapsulate some
>>>>>>>>> producer level
>>>>>>>>>>>>>>>>>>>>> configs makes API more readable.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman
>>>>> <
>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that is a good point about trying to keep
>>>>>>>>> producer level
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations and (repartition) topic level
>>>>>>>>> considerations
>>>>>>>>>>>>>>>>>>>>> separate.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Number of partitions is definitely purely a topic
>>>>>>>>> level
>>>>>>>>>>>>>>>>>>>>> configuration. But
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on some level, serdes and partitioners are just
>>>>> as
>>>>>>>>> much a topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration as a producer one. You could have
>>>>> two
>>>>>>>>> producers
>>>>>>>>>>>>>>>>>>>>> configured
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with different serdes and/or partitioners, but if
>>>>>>>>> they are
>>>>>>>>>>>>>>>>>>>>> writing to the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same topic the result would be very difficult to
>>>>>>>>> part. So in a
>>>>>>>>>>>>>>>>>>>>> sense, these
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are configurations of topics in Streams, not just
>>>>>>>>> producers.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another way to think of it: while the Streams
>>>>> API is
>>>>>>>>> not always
>>>>>>>>>>>>>>>>>>>>> true to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this, ideally all the relevant configs for an
>>>>>>>>> operator are
>>>>>>>>>>>>>>>>>>>>> wrapped into a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> single object (in this case, Repartitioned). We
>>>>> could
>>>>>>>>> instead
>>>>>>>>>>>>>>>>>>>>> split out the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fields in common with Produced into a separate
>>>>>>>>> parameter to keep
>>>>>>>>>>>>>>>>>>>>> topic and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> producer level configurations separate, but this
>>>>>>>>> increases the
>>>>>>>>>>>>>>>>>>>>> API surface
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> area by a lot. It's much more straightforward to
>>>>> just
>>>>>>>>> say "this
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything that this particular operator needs"
>>>>>>>>> without worrying
>>>>>>>>>>>>>>>>>>>>> about what
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exactly you're specifying.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suppose you could alternatively make Produced a
>>>>>>>>> field of
>>>>>>>>>>>>>>>>>>>>> Repartitioned,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but I don't think we do this kind of composition
>>>>>>>>> elsewhere in
>>>>>>>>>>>>>>>>>>>>> Streams at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the moment
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani
>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Bill,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
>>>>>>>>>>>>>>>>>>>>> `Repartitioned#partitioner`
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class
>>>>> for
>>>>>>>>> topic level
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration and keep topic level and producer
>>>>> level
>>>>>>>>>>>>>>>>>>>>> configurations (such
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as Produced) separately (see my second email in
>>>>> this
>>>>>>>>> thread).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But while looking at the semantics of KStream
>>>>>>>>> interface, I
>>>>>>>>>>>>>>>>>>>>> couldn’t really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> figure out good operation name for Topic level
>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>> class and just
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing `Topic` config class was kinda
>>>>> breaking
>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So I think having Repartitioned class which
>>>>>>>>> encapsulates topic
>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> producer level configurations for internal
>>>>> topics is
>>>>>>>>> viable
>>>>>>>>>>>>>>>>>>>>> thing to do.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <
>>>>>>>>> bbejeck@gmail.com
>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>
>>>>>>>>>>>>>>>>>>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>>>>>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Lavani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.
>>>>> In
>>>>>>>>> addition to
>>>>>>>>>>>>>>>>>>>>> the reason
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Users may want to use something other than
>>>>>>>>> hash-based
>>>>>>>>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Users may wish to partition on something
>>>>>>>>> different than the
>>>>>>>>>>>>>>>>>>>>> key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. A combination of fields in the value in
>>>>>>>>> conjunction with
>>>>>>>>>>>>>>>>>>>>> the key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Something other than the key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on
>>>>>>>>> Produced hence
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes
>>>>> sense
>>>>>>>>> for API
>>>>>>>>>>>>>>>>>>>>> consistency.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just my  2 cents.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani
>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>>>> levani.codes@gmail.com>>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we add partitioner configuration to
>>>>>>>>> Repartitioned class,
>>>>>>>>>>>>>>>>>>>>> with the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> combination of specifying number of
>>>>> partitions for
>>>>>>>>> internal
>>>>>>>>>>>>>>>>>>>>> topics, user
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will have opportunity to ensure
>>>>> co-partitioning
>>>>>>>>> before join
>>>>>>>>>>>>>>>>>>>>> operation.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
>>>>>>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io>
>>>>>>>>>>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind.
>>>>> Again,
>>>>>>>>> not totally
>>>>>>>>>>>>>>>>>>>>> sure it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar
>>>>> is
>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> rationale
>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani
>>>>> Kokhreidze
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
>>>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>>>> levani.codes@gmail.com>>
>>>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>>>> levani.codes@gmail.com>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey John,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your
>>>>> example
>>>>>>>>> I would
>>>>>>>>>>>>>>>>>>>>> first issue
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper
>>>>> partitioner
>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>> essentially
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be the same as the topic I want to join with
>>>>> and
>>>>>>>>> then do the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#join
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with DSL?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler
>>>>> <
>>>>>>>>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>> john@confluent.io
>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> <mailto:john@confluent.io
>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> john@confluent.io>
>>>>>>>>>>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an
>>>>> option to
>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is
>>>>> that
>>>>>>>>> some data may
>>>>>>>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input
>>>>>>>>> topic. If the
>>>>>>>>>>>>>>>>>>>>> right-side
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning
>>>>>>>>> strategy, then the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be
>>>>>>>>> partitioned with the
>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss
>>>>>>>>> something
>>>>>>>>>>>>>>>>>>>>> important?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani
>>>>>>>>> Kokhreidze
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
>>>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>>>> levani.codes@gmail.com>>
>>>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>>>> levani.codes@gmail.com>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To
>>>>> be
>>>>>>>>> honest I’m
>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really
>>>>>>>>> think I would
>>>>>>>>>>>>>>>>>>>>> need control
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka
>>>>> Streams
>>>>>>>>> knows best
>>>>>>>>>>>>>>>>>>>>> how to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should
>>>>> be
>>>>>>>>> used only for
>>>>>>>>>>>>>>>>>>>>> topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have
>>>>>>>>> possibility to
>>>>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet,
>>>>> but I
>>>>>>>>> guess
>>>>>>>>>>>>>>>>>>>>> specifying the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s
>>>>>>>>> agreement on
>>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie
>>>>>>>>> Blee-Goldman <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <mailto:
>>>>> sophie@confluent.io>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that
>>>>>>>>> Repartitioned
>>>>>>>>>>>>>>>>>>>>> would be a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also
>>>>> need
>>>>>>>>> to have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field,
>>>>> similar
>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> Produced? I'm not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sure how
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but
>>>>> seems
>>>>>>>>> it should
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani
>>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>>>> levani.codes@gmail.com>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by
>>>>> Kafka
>>>>>>>>> Streams.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user
>>>>> more
>>>>>>>>> control over
>>>>>>>>>>>>>>>>>>>>> the topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is
>>>>>>>>> something that
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> missing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over
>>>>>>>>> parallelism by
>>>>>>>>>>>>>>>>>>>>> configuring
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie
>>>>>>>>> Blee-Goldman <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <mailto:
>>>>> sophie@confluent.io>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one
>>>>>>>>> thing for me
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> for the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a
>>>>>>>>> Repartitioned,
>>>>>>>>>>>>>>>>>>>>> will the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems
>>>>> to be
>>>>>>>>> the case
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> signature
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it
>>>>> have to
>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>> pre-created? The
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wording
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version
>>>>> of
>>>>>>>>> the method
>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani
>>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>>>> levani.codes@gmail.com>
>>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing
>>>>> list :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities
>>>>>>>>> opinions/concerns
>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>> this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani
>>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani
>>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>>
>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward,
>>>>> I’ve
>>>>>>>>> updated
>>>>>>>>>>>>>>>>>>>>> confluence
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> page
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected
>>>>> Alternatives”
>>>>>>>>> section.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP
>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani
>>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>>
>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing
>>>>>>>>> dedicated `Topic`
>>>>>>>>>>>>>>>>>>>>> class for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators
>>>>> like
>>>>>>>>>>>>>>>>>>> `groupedBy`.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others
>>>>> opinion
>>>>>>>>> about this
>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM,
>>>>> Matthias
>>>>>>>>> J. Sax <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> matthias@confluent.io
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>>
>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP!
>>>>> And
>>>>>>>>> thanks for
>>>>>>>>>>>>>>>>>>>>> summarizing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been
>>>>>>>>> discussed
>>>>>>>>>>>>>>>>>>>>> already (can't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a
>>>>> good
>>>>>>>>> summary to
>>>>>>>>>>>>>>>>>>>>> refresh the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes
>>>>> sense.
>>>>>>>>> With regard
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage
>>>>> topics
>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> operators
>>>>>>>>>>>>>>>>>>>>> that use
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument,
>>>>> it
>>>>>>>>> might
>>>>>>>>>>>>>>>>>>> indicate
>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator
>>>>> that
>>>>>>>>> uses
>>>>>>>>>>>>>>>>>>>>> use-defined
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()`
>>>>>>>>> operator (an
>>>>>>>>>>>>>>>>>>>>> operator that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good.
>>>>>>>>> Otherwise, there is
>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages
>>>>>>>>> topics but
>>>>>>>>>>>>>>>>>>>>> sometimes
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not; a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make
>>>>> the
>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>> clearer.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to
>>>>>>>>> `Grouped`. I am
>>>>>>>>>>>>>>>>>>>>> wondering
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does
>>>>> apply
>>>>>>>>> and adding
>>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get
>>>>>>>>> opinions of
>>>>>>>>>>>>>>>>>>>>> others on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I
>>>>> prefer
>>>>>>>>> atm.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration
>>>>> objects
>>>>>>>>> that allow
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configure
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer,
>>>>> producer,
>>>>>>>>> store. If we
>>>>>>>>>>>>>>>>>>>>> assume that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if
>>>>> we
>>>>>>>>> should have a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()`
>>>>> class
>>>>>>>>> and method
>>>>>>>>>>>>>>>>>>>>> instead of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to
>>>>> add
>>>>>>>>> other
>>>>>>>>>>>>>>>>>>> configs,
>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc,
>>>>> easily,
>>>>>>>>> without the
>>>>>>>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not
>>>>> sure
>>>>>>>>> if I like
>>>>>>>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> myself.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani
>>>>> Kokhreidze
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though -
>>>>>>>>> maybe
>>>>>>>>>>>>>>>>>>> enhancing
>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the
>>>>>>>>> best approach.
>>>>>>>>>>>>>>>>>>>>> Let me
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class
>>>>> with
>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>> configuration,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation.
>>>>> Since
>>>>>>>>> KStream#to is
>>>>>>>>>>>>>>>>>>>>> the final
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sink of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be
>>>>> reasonable
>>>>>>>>>>>>>>>>>>> assumption
>>>>>>>>>>>>>>>>>>>>> that user
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in
>>>>> advance. And
>>>>>>>>> in that
>>>>>>>>>>>>>>>>>>>>> case, having
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make
>>>>> much
>>>>>>>>> sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class,
>>>>> based
>>>>>>>>> on API
>>>>>>>>>>>>>>>>>>>>> contract, seems
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something
>>>>> that
>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> explicitly for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer,
>>>>> partitioner
>>>>>>>>> those all
>>>>>>>>>>>>>>>>>>>>> are producer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions
>>>>> is
>>>>>>>>> topic level
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer
>>>>> level
>>>>>>>>>>>>>>>>>>>>> configurations
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> together in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface,
>>>>>>>>> seems like
>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with
>>>>> non-internal
>>>>>>>>> (e.g
>>>>>>>>>>>>>>>>>>> topics
>>>>>>>>>>>>>>>>>>>>> created
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics
>>>>> and I
>>>>>>>>> think we
>>>>>>>>>>>>>>>>>>>>> should leave
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into
>>>>> account,
>>>>>>>>> I think we
>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka
>>>>>>>>> Streams should
>>>>>>>>>>>>>>>>>>>>> create and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs
>>>>>>>>> topics that
>>>>>>>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding
>>>>>>>>> numPartitions
>>>>>>>>>>>>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and
>>>>> producer
>>>>>>>>> level
>>>>>>>>>>>>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API
>>>>>>>>> semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name
>>>>> optional
>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>> KStream#through - I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and
>>>>> giving
>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>>> possibility to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more
>>>>> useful :)
>>>>>>>>>>>>>>>>>>> Considering
>>>>>>>>>>>>>>>>>>>>> arguments
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced
>>>>>>>>> class, I see two
>>>>>>>>>>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
>>>>>>>>> auto-generated and
>>>>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
>>>>> numOfPartitions)
>>>>>>>>> - topic
>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of
>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
>>>>> numOfPartitions,
>>>>>>>>> final
>>>>>>>>>>>>>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with
>>>>> generated
>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
>>>>>>>>> parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it
>>>>> is and
>>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>>>> new method
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias
>>>>>>>>> suggested this
>>>>>>>>>>>>>>>>>>>>> in one of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I
>>>>>>>>> propose the
>>>>>>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
>>>>>>>>> configuration to
>>>>>>>>>>>>>>>>>>> Grouped
>>>>>>>>>>>>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method
>>>>> overloads to
>>>>>>>>>>>>>>>>>>>>> KStream#through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
>>>>>>>>> auto-generated and
>>>>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
>>>>> numOfPartitions)
>>>>>>>>> - topic
>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of
>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
>>>>> numOfPartitions,
>>>>>>>>> final
>>>>>>>>>>>>>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with
>>>>> generated
>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
>>>>>>>>> parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
>>>>>>>>> configuration to
>>>>>>>>>>>>>>>>>>> Grouped
>>>>>>>>>>>>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator
>>>>>>>>> KStream#repartition for
>>>>>>>>>>>>>>>>>>>>> creating and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was
>>>>>>>>> already
>>>>>>>>>>>>>>>>>>>>> discussed in the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the
>>>>> threads
>>>>>>>>> that were
>>>>>>>>>>>>>>>>>>>>> about this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM,
>>>>> Levani
>>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com>>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect
>>>>> discussion
>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>> KIP-221. Going
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems
>>>>> to
>>>>>>>>> agreement
>>>>>>>>>>>>>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation,
>>>>> as far
>>>>>>>>> as I
>>>>>>>>>>>>>>>>>>>>> understood, the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the
>>>>>>>>> following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to
>>>>>>>>> KStream#through
>>>>>>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with
>>>>>>>>> numOfPartitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL
>>>>>>>>> users to
>>>>>>>>>>>>>>>>>>> control
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing
>>>>> stateful
>>>>>>>>>>>>>>>>>>> operations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface
>>>>>>>>> changes around
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>
> 
> 


Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by Levani Kokhreidze <le...@gmail.com>.
Matthias,

Yes, I agree. KIP is updated: https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+DSL+with+Connecting+Topic+Creation+and+Repartition+Hint <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+DSL+with+Connecting+Topic+Creation+and+Repartition+Hint> and follow-up JIRA ticket is linked in “Rejected Alternatives” section.

Thank you all for an interesting discussion.

Kind Regards,
Levani

> On Nov 16, 2019, at 10:11 AM, Matthias J. Sax <ma...@confluent.io> wrote:
> 
> Levani,
> 
> do you agree to the current proposal? It's basically a de-scoping of the
> already voted KIP. If you agree, could you update the KIP wiki page
> accordingly, including the "Rejected Alternative" section (and mabye a
> link to a follow up Jira ticket).
> 
> Because it's a descope, and John and myself support it, there seems to
> be no need to re-vote.
> 
> @Sophie,John: thanks a lot for your thoughtful input!
> 
> 
> -Matthias
> 
> On 11/15/19 12:47 PM, John Roesler wrote:
>> Thanks Sophie,
>> 
>> I think your concern is valid, and also that your idea to make a
>> ticket is a good idea.
>> 
>> Creating a ticket has some very positive effects:
>> * It allows us to record the thinking at this point in time so we
>> don't have to dig through the mail archives later
>> * It demonstrates that we did consider the use case, and do want to
>> address it, but just don't feel confident to implement it right now.
>> Then, if/when people do have a problem with the gap, the ticket it
>> already there for them to consider, request, or even pick up.
>> 
>> Since one aspect of the deferral is a desire to wait for real use
>> experience, we should explicitly mention that in the ticket. This is
>> just good information for people browsing the Jira looking for
>> interesting tickets to pick up. They could still pick it up, but they
>> can ask themselves if they really understand the real-world use cases
>> any better than we do right now.
>> 
>> Thanks, likewise, to you for the good discussion!
>> -John
>> 
>> On Fri, Nov 15, 2019 at 2:37 PM Sophie Blee-Goldman <so...@confluent.io> wrote:
>>> 
>>> While I'm concerned that "not augmenting groupBy as part of this KIP"
>>> really translates to "will not get around to augmenting groupBy for a long
>>> time if not as part of this KIP", like I said I don't want to hold up the
>>> new
>>> .repartition operator that it seems we do, at least, all agree on. It's a
>>> fair
>>> point that we can always add this in later, but undoing it is far more
>>> problematic.
>>> 
>>> Anyways, I would be happy if we at least make a ticket to consider adding a
>>> "number of partitions" option/suggestion to groupBy, so that we don't lose
>>> all the thought put in to this decision so far and can avoid rehashing the
>>> same
>>> argument word for word and have something to point to when someone
>>> asks "why didn't we add this numPartitions option to groupBy".
>>> 
>>> Beyond that, if the community isn't pushing for it at this moment then it
>>> seems very
>>> reasonable to shelve the idea for now so that the rest of this KIP can
>>> proceed.
>>> Without input one way or another it's hard to say what the right thing to
>>> do is,
>>> which makes the right thing to do "wait to add this feature"
>>> 
>>> Thanks for the good discussion everyone,
>>> 
>>> Sophie
>>> 
>>> On Fri, Nov 15, 2019 at 12:41 PM John Roesler <jo...@confluent.io> wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> I think that Sophie is asking a good question, and I do think that
>>>> such "blanket configurations" are plausible. For example, we currently
>>>> support (and I would encourage) "I don't know if this is going to
>>>> create a repartition topic, but if it does, then use this name instead
>>>> of generating one".
>>>> 
>>>> I'm not sure I'm convinced that specifying max parallelism falls into
>>>> this category. After all, the groupByKey+aggregate will be executed
>>>> with _some_ max parallelism. It's either the same as the inputs'
>>>> partition count or overridden with the proposed config. It seems
>>>> counterintuitive to override the specified option with the default
>>>> value.
>>>> 
>>>> I'm not sure if I can put my finger on it, but "maybe use this name"
>>>> seems way more reasonable to me than "maybe execute with this degree
>>>> of parallelism".
>>>> 
>>>> I do think (and I appreciate that this is where Sophie's example is
>>>> coming from) that Streams should strive to be absolutely as simple and
>>>> intuitive as possible (while still maintaining correctness). Optimal
>>>> performance can be at odds with API simplicity. For example, the
>>>> simplest behavior is, if you ask for 5 partitions, you get 5
>>>> partitions. Maybe a repartition is technically not necessary (if you
>>>> didn't change the key), but at least there's no mystery to this
>>>> behavior.
>>>> 
>>>> Clearly, an (opposing) tenent of simplicity is trying to prevent
>>>> people from making mistakes, which I think is what the example boils
>>>> down to. Sometimes, we can prevent clear mistakes, like equi-joining
>>>> two topics with different partition counts. But for this case, it
>>>> doesn't seem as clear-cut to be able to assume that they _said_ 5
>>>> partitions, but they didn't really _want_ 5 partitions. Maybe we can
>>>> just try to be clear in the documentation, and also even log a warning
>>>> when we parse the topology, "hey, I've been asked to repartition this
>>>> stream, but it's not necessary".
>>>> 
>>>> If anything, this discussion really supports to me the value in just
>>>> sticking with `repartition()` for now, and deferring
>>>> `groupBy[Key](partitions)` to the future.
>>>> 
>>>>> Users should not have to choose between allowing Streams to optimize the
>>>> repartition placement, and allowing to specify a number of partitions.
>>>> 
>>>> This is a very fair point, and it may be something that we rapidly
>>>> return to, but it seems safe for now to introduce the non-optimizable
>>>> `reparition()` only, and then consider optimization options later.
>>>> Skipping available optimizations will never break correctness, but
>>>> adding optimizations can, so it makes sense to treat them with
>>>> caution.
>>>> 
>>>> In conclusion, I do think that a use _could_ want to "maybe specify"
>>>> the partition count, but I also think we can afford to pass on
>>>> supporting this right now.
>>>> 
>>>> I'm open to continuing the discussion, but just to avoid ambiguity, I
>>>> still feel we should _not_ change the groupBy[Key] operation at all,
>>>> and we should only add `repartition()` as a non-optimizable operation.
>>>> 
>>>> Thanks all,
>>>> -John
>>>> 
>>>> On Fri, Nov 15, 2019 at 11:26 AM Levani Kokhreidze
>>>> <le...@gmail.com> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> Just fyi, PR was updated and now it incorporates the latest suggestions
>>>> about joins.
>>>>> `CopartitionedTopicsEnforcer` will throw an exception if number of
>>>> partitions aren’t the same when using `repartition` operation along with
>>>> `join`.
>>>>> 
>>>>> For more details please take a look at the PR:
>>>> https://github.com/apache/kafka/pull/7170/files <
>>>> https://github.com/apache/kafka/pull/7170/files>
>>>>> 
>>>>> Regards,
>>>>> Levani
>>>>> 
>>>>> 
>>>>>> On Nov 15, 2019, at 11:01 AM, Matthias J. Sax <ma...@confluent.io>
>>>> wrote:
>>>>>> 
>>>>>> Thanks a lot for the input Sophie.
>>>>>> 
>>>>>> Your example is quite useful, and I would use it to support my claim
>>>>>> that a "partition hint" for `Grouped` seems "useless" and does not
>>>>>> improve the user experience.
>>>>>> 
>>>>>> 1) You argue that a new user would be worries about repartitions topics
>>>>>> with too many paritions. This would imply that a user is already
>>>>>> advanced enough to understand the implication of repartitioning -- for
>>>>>> this case, I would argue that a user also understand _when_ a
>>>>>> auto-repartitioning would happen and thus the users understands where
>>>> to
>>>>>> insert a `repartition()` operation.
>>>>>> 
>>>>>> 2) For specifying Serdes: if a `groupByKey()` does not trigger
>>>>>> auto-repartitioning it's not required to specify the serdes and if they
>>>>>> are specified they would be ignored/unused (note, that `groupBy()`
>>>> would
>>>>>> always trigger a repartitioning). Of course, if the default Serdes from
>>>>>> the config match (eg, all data types are Json anyway), a user does not
>>>>>> need to worry about specifying serdes. -- For new user that play
>>>> around,
>>>>>> I would assume that they work a lot with primitive types and thus would
>>>>>> need to specify the serdes -- hence, they would learn about
>>>>>> auto-repartitioning the hard way anyhow, because each time a
>>>>>> `groupByKey()` does trigger auto-repartioning, they would need to pass
>>>>>> in the correct Serdes -- this way, they would also be educated where to
>>>>>> insert a `repartition()` operator if needed.
>>>>>> 
>>>>>> 3) If a new user really just "plays around", I don't think they use an
>>>>>> input topic with 100 partitions but most likely have a local single
>>>> node
>>>>>> broker with most likely single partitions topics.
>>>>>> 
>>>>>> 
>>>>>> My main argument for my current proposal is however, that---based on
>>>>>> past experience---it's better to roll out a new feature more carefully
>>>>>> and see how it goes. Last, as John pointed out, we can still extend the
>>>>>> feature in the future. Instead of making a judgment call up-front,
>>>> being
>>>>>> more conservative and less fancy, and revisit the design based on
>>>>>> actuall user feedback after the first version is rolled out, seems to
>>>> be
>>>>>> the better option. Undoing a feature is must harder than extending it.
>>>>>> 
>>>>>> 
>>>>>> While I advocate strong for a simple first version of this feature,
>>>> it's
>>>>>> a community decission in the end, and I would not block this KIP if
>>>>>> there is a broad preference to add `Grouped#withNumberOfPartitions()`
>>>>>> either.
>>>>>> 
>>>>>> 
>>>>>> -Matthias
>>>>>> 
>>>>>> On 11/14/19 11:35 PM, Sophie Blee-Goldman wrote:
>>>>>>> It seems like we all agree at this point (please correct me if
>>>> wrong!) that
>>>>>>> we should NOT change
>>>>>>> the existing repartitioning behavior, ie we should allow Streams to
>>>>>>> continue to determine when and
>>>>>>> where to repartition -- *unless* explicitly informed to by the use of
>>>> a
>>>>>>> .through or the new .repartition operator.
>>>>>>> 
>>>>>>> Regarding groupBy, the existing behavior we should not disrupt is
>>>>>>> a) repartition *only* when required due to upstream key-changing
>>>> operation
>>>>>>> (ie don't force repartitioning
>>>>>>> based on the presence of an optional config parameter), and
>>>>>>> b) allow optimization of required repartitions, if any
>>>>>>> 
>>>>>>> Within the constraint of not breaking the existing behavior, this
>>>> still
>>>>>>> leaves open the question of whether we
>>>>>>> want to improve the user experience by allowing to provide groupBy
>>>> with a
>>>>>>> *suggestion* for numPartitions (or to
>>>>>>> put it more fairly, whether that *will* improve the experience). I
>>>> agree
>>>>>>> with many of the arguments outlined above but
>>>>>>> let me just push back on this one issue one final time, and if we
>>>> can't
>>>>>>> come to a consensus then I am happy to drop
>>>>>>> it for now so that the KIP can proceed.
>>>>>>> 
>>>>>>> Specifically, my proposal would be to simply augment Grouped with an
>>>>>>> optional numPartitions, understood to
>>>>>>> indicate the user's desired number of partitions *if Streams decides
>>>> to
>>>>>>> repartition due to that groupBy*
>>>>>>> 
>>>>>>>> if a user cares about the number of partition, the user wants to
>>>> enforce
>>>>>>> a repartitioning
>>>>>>> First, I think we should take a step back and examine this claim. I
>>>> agree
>>>>>>> 100% that *if this is true,*
>>>>>>> *then we should not give groupBy an optional numPartitions.* As far
>>>> as I
>>>>>>> see it, there's no argument
>>>>>>> to be had there if we *presuppose that claim.* But I'm not convinced
>>>> in
>>>>>>> that as an axiom of the user
>>>>>>> experience and think we should be examining that claim itself, not the
>>>>>>> consequences of it.
>>>>>>> 
>>>>>>> To give a simple example, let's say some new user is trying out
>>>> Streams and
>>>>>>> wants to just play around
>>>>>>> with it to see if it might be worth looking into. They want to just
>>>> write
>>>>>>> up a simple app and test it out on the
>>>>>>> data in some existing topics they have with a large number of
>>>> partitions,
>>>>>>> and a lot of data. They're just messing
>>>>>>> around, trying new topologies and don't want to go through each new
>>>> one
>>>>>>> step by step to determine if (or where)
>>>>>>> a repartition might be required. They also don't want to force a
>>>>>>> repartition if it turns out to not be required, so they'd
>>>>>>> like to avoid the nice new .repartition operator they saw. But given
>>>> the
>>>>>>> huge number of input partitions, they'd like
>>>>>>> to rest assured that if a repartition does end up being required
>>>> somewhere
>>>>>>> during dev, it will not be created with
>>>>>>> the same huge number of partitions that their input topic has -- so
>>>> they
>>>>>>> just pass groupBy a small numPartitions
>>>>>>> suggestion.
>>>>>>> 
>>>>>>> I know that's a bit of a contrived example but I think it does
>>>> highlight
>>>>>>> how and when this might be a considerable
>>>>>>> quality of life improvement, in particular for new users to Streams
>>>> and/or
>>>>>>> during the dev cycle. *You don't want to*
>>>>>>> *force a repartition if it wasn't necessary, but you don't want to
>>>> create a
>>>>>>> topic with a huge partition count either.*
>>>>>>> 
>>>>>>> Also, while the optimization discussion took us down an interesting
>>>> but
>>>>>>> ultimately more distracting road, it's worth
>>>>>>> pointing out that it is clearly a major win to have as few
>>>>>>> repartition topics/steps as possible. Given that we
>>>>>>> don't want to change existing behavior, the optimization framework
>>>> can only
>>>>>>> help out when the placement of
>>>>>>> repartition steps is flexible, which means only those from .groupBy
>>>> (and
>>>>>>> not .repartition). *Users should not*
>>>>>>> *have to choose between allowing Streams to optimize the repartition
>>>>>>> placement, and allowing to specify a *
>>>>>>> *number of partitions.*
>>>>>>> 
>>>>>>> Lastly, I have what may be a stupid question but for my own
>>>> edification of
>>>>>>> how groupBy works:
>>>>>>> if you do a .groupBy and a repartition is NOT required, does it ever
>>>> need
>>>>>>> to serialize/deserialize
>>>>>>> any of the data? In other words, if you pass a key/value serde to
>>>> groupBy
>>>>>>> and it doesn't trigger
>>>>>>> a repartition, is the serde(s) just ignored and thus more like a
>>>> suggestion
>>>>>>> than a requirement?
>>>>>>> 
>>>>>>> So again, I don't want to hold up this KIP forever but I feel we've
>>>> spent
>>>>>>> some time getting slightly
>>>>>>> off track (although certainly into very interesting discussions) yet
>>>> never
>>>>>>> really addressed or questioned
>>>>>>> the basic premise: *could a user want to specify a number of
>>>> partitions but
>>>>>>> not enforce a repartition (at that*
>>>>>>> *specific point in the topology)?*
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Nov 15, 2019 at 12:18 AM Matthias J. Sax <
>>>> matthias@confluent.io>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Side remark:
>>>>>>>> 
>>>>>>>> If the user specifies `repartition()` on both side of the join, we
>>>> can
>>>>>>>> actually throw the execption earlier, ie, when we build the topology.
>>>>>>>> 
>>>>>>>> Current, we can do this check only after Kafka Streams was started,
>>>>>>>> within `StreamPartitionAssignor#assign()` -- we still need to keep
>>>> this
>>>>>>>> check for the case that none or only one side has a user specified
>>>>>>>> number of partitions though.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -Matthias
>>>>>>>> 
>>>>>>>> On 11/14/19 8:15 AM, John Roesler wrote:
>>>>>>>>> Thanks, all,
>>>>>>>>> 
>>>>>>>>> I can get behind just totally leaving out reparation-via-groupBy. If
>>>>>>>>> we only introduce `repartition()` for now, we're making the minimal
>>>>>>>>> change to gain the desired capability.
>>>>>>>>> 
>>>>>>>>> Plus, since we agree that `repartition()` should never be
>>>> optimizable,
>>>>>>>>> it's a future-compatible proposal. I.e., if we were to add a
>>>>>>>>> non-optimizable groupBy(partitions) operation now, and want to make
>>>> it
>>>>>>>>> optimizable in the future, we have to worry about topology
>>>>>>>>> compatibility. Better to just do non-optimizable `repartition()`
>>>> now,
>>>>>>>>> and add an optimizable `groupBy(partitions)` in the future (maybe).
>>>>>>>>> 
>>>>>>>>> About joins, yes, it's a concern, and IMO we should just do the same
>>>>>>>>> thing we do now... check at runtime that the partition counts on
>>>> both
>>>>>>>>> sides match and throw an exception otherwise. What this means as a
>>>>>>>>> user is that if you explicitly repartition the left side to 100
>>>>>>>>> partitions, and then join with the right side at 10 partitions, you
>>>>>>>>> get an exception, since this operation is not possible. You'd either
>>>>>>>>> have to "step down" the left side again, back to 10 partitions, or
>>>> you
>>>>>>>>> could repartition the right side to 100 partitions before the join.
>>>>>>>>> The choice has to be the user's, since it depends on their desired
>>>>>>>>> execution parallelism.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> -John
>>>>>>>>> 
>>>>>>>>> On Thu, Nov 14, 2019 at 12:55 AM Matthias J. Sax <
>>>> matthias@confluent.io>
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Thanks a lot John. I think the way you decompose the operators is
>>>> super
>>>>>>>>>> helpful for this discussion.
>>>>>>>>>> 
>>>>>>>>>> What you suggest with regard to using `Grouped` and enforcing
>>>>>>>>>> repartitioning if the number of partitions is specified is
>>>> certainly
>>>>>>>>>> possible. However, I am not sure if we _should_ do this. My
>>>> reasoning is
>>>>>>>>>> that an enforce repartitioning as introduced via `repartition()`
>>>> is an
>>>>>>>>>> expensive operations, and it seems better to demand an more
>>>> explicit
>>>>>>>>>> user opt-in to trigger it. Just setting an optional parameter
>>>> might be
>>>>>>>>>> too subtle to trigger such a heavy "side effect".
>>>>>>>>>> 
>>>>>>>>>> While I agree about "usability" in general, I would prefer a more
>>>>>>>>>> conservative appraoch to introduce this feature, see how it goes,
>>>> and
>>>>>>>>>> maybe make it more advance later on. This also applies to what
>>>>>>>>>> optimzation we may or may not allow (or are able to perform at
>>>> all).
>>>>>>>>>> 
>>>>>>>>>> @Levani: Reflecting about my suggestion about `Repartioned extends
>>>>>>>>>> Grouped`, I agree that it might not be a good idea.
>>>>>>>>>> 
>>>>>>>>>> Atm, I see an enforces repartitioning as non-optimizable and as a
>>>> good
>>>>>>>>>> first step and I would suggest to not intoruce anything else for
>>>> now.
>>>>>>>>>> Introducing optimizable enforce repartitioning via `groupBy(...,
>>>>>>>>>> Grouped)` is something we could add later.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Therefore, I would not change `Grouped` but only introduce
>>>>>>>>>> `repartition()`. Users that use `grouBy()` atm, and want to opt-in
>>>> to
>>>>>>>>>> set the number of partitions, would need to rewrite their code to
>>>>>>>>>> `selectKey(...).repartition(...).groupByKey()`. It's less
>>>> convinient but
>>>>>>>>>> also less risky from an API and optimization point of view.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> @Levani: about joins -> yes, we will need to check the specified
>>>> number
>>>>>>>>>> of partitions (if any) and if they don't match, throw an
>>>> exception. We
>>>>>>>>>> can discuss this on the PR -- I am just trying to get the PR for
>>>> KIP-466
>>>>>>>>>> merged -- your is next on the list :)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thoughts?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -Matthias
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 11/12/19 4:51 PM, Levani Kokhreidze wrote:
>>>>>>>>>>> Thank you all for an interesting discussion. This is very
>>>> enlightening.
>>>>>>>>>>> 
>>>>>>>>>>> Thank you Matthias for your explanation. Your arguments are very
>>>> true.
>>>>>>>> It makes sense that if user specifies number of partitions he/she
>>>> really
>>>>>>>> cares that those specifications are applied to internal topics.
>>>>>>>>>>> Unfortunately, in current implementation this is not true during
>>>>>>>> `join` operation. As I’ve written in the PR comment, currently, when
>>>>>>>> `Stream#join` is used, `CopartitionedTopicsEnforcer` chooses max
>>>> number of
>>>>>>>> partitions from the two source topics.
>>>>>>>>>>> I’m not really sure what would be the other way around this
>>>> situation.
>>>>>>>> Maybe fail the stream altogether and inform the user to specify same
>>>> number
>>>>>>>> of partitions?
>>>>>>>>>>> Or we should treat join operations in a same way as it is right
>>>> now
>>>>>>>> and basically choose max number of partitions even when `repartition`
>>>>>>>> operation is specified, because Kafka Streams “knows the best” how to
>>>>>>>> handle joins?
>>>>>>>>>>> You can check integration tests how it’s being handled currently.
>>>> Open
>>>>>>>> to suggestions on that part.
>>>>>>>>>>> 
>>>>>>>>>>> As for groupBy, I agree and John raised very interesting points.
>>>> My
>>>>>>>> arguments for allowing users to specify number of partitions during
>>>> groupBy
>>>>>>>> operations mainly was coming from the usability perspective.
>>>>>>>>>>> So building on top of what John said, maybe it makes sense to make
>>>>>>>> `groupBy` operations smarter and whenever user specifies
>>>>>>>> `numberOfPartitions` configuration, repartitioning will be enforced,
>>>> wdyt?
>>>>>>>>>>> I’m not going into optimization part yet :) I think it will be
>>>> part of
>>>>>>>> separate PR and task, but overall it makes sense to apply
>>>> optimizations
>>>>>>>> where number of partitions are the same.
>>>>>>>>>>> 
>>>>>>>>>>> As for Repartitioned extending Grouped, I kinda feel that it
>>>> won’t fit
>>>>>>>> nicely in current API design.
>>>>>>>>>>> In addition, in the PR review, John mentioned that there were a
>>>> lot of
>>>>>>>> troubles in the past trying to use one operation's configuration
>>>> objects on
>>>>>>>> other operations.
>>>>>>>>>>> Also it makes sense to keep them separate in terms of
>>>> compatibility.
>>>>>>>>>>> In that case, we don’t have to worry every time Grouped is
>>>> changed,
>>>>>>>> what would be the implications on `repartition` operations.
>>>>>>>>>>> 
>>>>>>>>>>> Kind regards,
>>>>>>>>>>> Levani
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Nov 11, 2019, at 9:13 PM, John Roesler <jo...@confluent.io>
>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Ah, thanks for the clarification. I missed your point.
>>>>>>>>>>>> 
>>>>>>>>>>>> I like the framework you've presented. It does seem simpler to
>>>> assume
>>>>>>>>>>>> that they either care about the partition count and want to
>>>>>>>>>>>> repartition to realize it, or they don't care about the number.
>>>>>>>>>>>> Returning to this discussion, it does seem unlikely that they
>>>> care
>>>>>>>>>>>> about the number and _don't_ care if it actually gets realized.
>>>>>>>>>>>> 
>>>>>>>>>>>> But then, it still seems like we can just keep the option as
>>>> part of
>>>>>>>>>>>> Grouped. As in:
>>>>>>>>>>>> 
>>>>>>>>>>>> // user does not care
>>>>>>>>>>>> stream.groupByKey(Grouped /*not specifying partition count*/)
>>>>>>>>>>>> stream.groupBy(Grouped /*not specifying partition count*/)
>>>>>>>>>>>> 
>>>>>>>>>>>> // user does care
>>>>>>>>>>>> stream.repartition(Repartitioned)
>>>>>>>>>>>> stream.groupByKey(Grouped.numberOfPartitions(...))
>>>>>>>>>>>> stream.groupBy(Grouped.numberOfPartitions(...))
>>>>>>>>>>>> 
>>>>>>>>>>>> ----
>>>>>>>>>>>> 
>>>>>>>>>>>> The above discussion got me thinking about algebra. Matthias is
>>>>>>>>>>>> absolutely right that `groupByKey(numPartitions)` is equivalent
>>>> to
>>>>>>>>>>>> `repartition(numPartitions).groupByKey()`. I'm just not
>>>> convinced that
>>>>>>>>>>>> we should force people to apply that expansion themselves vs.
>>>> having a
>>>>>>>>>>>> more compact way to express it if they don't care where exactly
>>>> the
>>>>>>>>>>>> repartition occurs. However, thinking about these operators
>>>>>>>>>>>> algebraically can really help *us* narrow down the number of
>>>> different
>>>>>>>>>>>> expressions we have to consider.
>>>>>>>>>>>> 
>>>>>>>>>>>> Let's consider some identities:
>>>>>>>>>>>> 
>>>>>>>>>>>> A: groupBy(mapper) + agg = mapKey(mapper) + groupByKey + agg
>>>>>>>>>>>> B: src + ... + groupByKey + agg = src + ... + passthough + agg
>>>>>>>>>>>> C: mapKey(mapper) + ... + groupByKey + agg
>>>>>>>>>>>> = mapKey(mapper) + ... + repartition + groupByKey + agg
>>>>>>>>>>>> D: repartition = sink(managed) + src
>>>>>>>>>>>> 
>>>>>>>>>>>> In these identities, I used one special identifier (...), which
>>>> means
>>>>>>>>>>>> any number (0+) of operations that are not src, mapKey,
>>>> groupBy[Key],
>>>>>>>>>>>> repartition, or agg.
>>>>>>>>>>>> 
>>>>>>>>>>>> For mental clarity, I'm just going to make up a rule that groupBy
>>>>>>>>>>>> operations are not executable. In other words, you have to get
>>>> to a
>>>>>>>>>>>> point where you can apply B to convert a groupByKey into a
>>>> passthough
>>>>>>>>>>>> in order to execute the program. This is just a formal way of
>>>> stating
>>>>>>>>>>>> what already happens in Kafka Streams.
>>>>>>>>>>>> 
>>>>>>>>>>>> By applying A, we can just completely leave `groupBy` out of our
>>>>>>>>>>>> analysis. It trivially decomposes into a mapKey followed by a
>>>>>>>>>>>> groupByKey.
>>>>>>>>>>>> 
>>>>>>>>>>>> Then, we can eliminate the "repartition required" case of
>>>> `groupByKey`
>>>>>>>>>>>> by applying C followed by D to get to the "no repartition
>>>> required"
>>>>>>>>>>>> version of groupByKey, which in turn sets us up to apply B to
>>>> get an
>>>>>>>>>>>> executable topology.
>>>>>>>>>>>> 
>>>>>>>>>>>> Fundamentally, you can think about KIP-221 is as proposing a
>>>> modified
>>>>>>>>>>>> D identity in which you can specify the partition count of the
>>>> managed
>>>>>>>>>>>> sink topic:
>>>>>>>>>>>> D': repartition(pc) = sink(managed w/ pc) + src
>>>>>>>>>>>> 
>>>>>>>>>>>> Since users _could_ apply the identities above, we don't
>>>> actually have
>>>>>>>>>>>> to add any partition count to groupBy[Key], but we decided early
>>>> on in
>>>>>>>>>>>> the KIP discussion that it's more ergonomic to add it. In that
>>>> case,
>>>>>>>>>>>> we also have to modify A and C:
>>>>>>>>>>>> A': groupBy(mapper, pc) + agg
>>>>>>>>>>>> = mapKey(mapper) + groupByKey(pc) + agg
>>>>>>>>>>>> C': mapKey(mapper) + ... + groupByKey(pc) + agg
>>>>>>>>>>>> = mapKey(mapper) + ... + repartition(pc) + groupByKey + agg
>>>>>>>>>>>> 
>>>>>>>>>>>> Which sets us up still to always be able to get back to a plain
>>>>>>>>>>>> `groupByKey` operation (with no `(pc)`) and then apply D' and
>>>>>>>>>>>> ultimately B to get an executable topology.
>>>>>>>>>>>> 
>>>>>>>>>>>> What about the optimizer?
>>>>>>>>>>>> The optimizer applies another set of graph-algebraic identities
>>>> to
>>>>>>>>>>>> minimize the number of repartition topics in a topology.
>>>>>>>>>>>> 
>>>>>>>>>>>> (forgive my ascii art)
>>>>>>>>>>>> 
>>>>>>>>>>>> E: (merging repartition nodes)
>>>>>>>>>>>> (...) -> repartition -> X
>>>>>>>>>>>> \-> repartition -> Y
>>>>>>>>>>>> =
>>>>>>>>>>>> (... + repartition) -> X
>>>>>>>>>>>>   \-> Y
>>>>>>>>>>>> F: (reordering around repartition)
>>>>>>>>>>>> Where SVO is any non-key-changing, stateless, operation:
>>>>>>>>>>>> repartition -> SVO = SVO -> repartition
>>>>>>>>>>>> 
>>>>>>>>>>>> In terms of these identities, what the optimizer does is apply F
>>>>>>>>>>>> repeatedly in either direction to a topology to factor out
>>>> common in
>>>>>>>>>>>> branches so that it can apply E to merge repartition nodes. This
>>>> was
>>>>>>>>>>>> especially necessary before KIP-221 because you couldn't directly
>>>>>>>>>>>> express `repartition` in the DSL, only indirectly via
>>>> `groupBy[Key]`,
>>>>>>>>>>>> so there was no way to do the factoring manually.
>>>>>>>>>>>> 
>>>>>>>>>>>> We can now state very clearly that in KIP-221, explicit
>>>>>>>>>>>> `repartition()` operators should create a "reordering barrier".
>>>> So, F
>>>>>>>>>>>> cannot be applied to an explicit `repartition()`. Also, I think
>>>> we
>>>>>>>>>>>> decided earlier that explicit `repartition()` operations would
>>>> also be
>>>>>>>>>>>> ineligible for merging, so E can't be applied to explicit
>>>>>>>>>>>> `repartition()` operations either. I think we feel we _could_
>>>> apply E
>>>>>>>>>>>> without harm, but we want to be conservative for now.
>>>>>>>>>>>> 
>>>>>>>>>>>> I think the salient point from the latter discussion has been
>>>> that
>>>>>>>>>>>> when you use `Grouped.numberOfPartitions`, this does _not_
>>>> constitute
>>>>>>>>>>>> an explicit `repartition()` operator, and therefore, the
>>>> resulting
>>>>>>>>>>>> repartition node remains eligible for optimization.
>>>>>>>>>>>> 
>>>>>>>>>>>> To be clear, I agree with Matthias that the provided partition
>>>> count
>>>>>>>>>>>> _must_ be used in the resulting implicit repartition. This has
>>>> some
>>>>>>>>>>>> implications for E. Namely, E could only be applied to two
>>>> repartition
>>>>>>>>>>>> nodes that have the same partition count. This has always been
>>>>>>>>>>>> trivially true before KIP-221 because the partition count has
>>>> always
>>>>>>>>>>>> been "unspecified", i.e., it would be determined at runtime by
>>>> the
>>>>>>>>>>>> user-managed-topics' partition counts. Now, it could be
>>>> specified or
>>>>>>>>>>>> unspecified. We can simply augment E to allow merging only
>>>> repartition
>>>>>>>>>>>> nodes where the partition count is EITHER "specified and the
>>>> same on
>>>>>>>>>>>> both sides", OR "unspecified on both sides".
>>>>>>>>>>>> 
>>>>>>>>>>>> Sorry for the long email, but I have a hope that it builds a
>>>> solid
>>>>>>>>>>>> theoretical foundation for our decisions in KIP-221, so we can
>>>> have
>>>>>>>>>>>> confidence that there are no edge cases for design flaws to hide.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> -John
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Nov 9, 2019 at 10:37 PM Matthias J. Sax <
>>>>>>>> matthias@confluent.io> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> it seems like we do want to allow
>>>>>>>>>>>>>>> people to optionally specify a partition count as part of this
>>>>>>>>>>>>>>> operation, but we don't want that option to _force_
>>>> repartitioning
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Correct, ie, that is my suggestions.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> "Use P partitions if repartitioning is necessary"
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I disagree here, because my reasoning is that:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - if a user cares about the number of partition, the user wants
>>>> to
>>>>>>>>>>>>> enforce a repartitioning
>>>>>>>>>>>>> - if a user does not case about the number of partitions, we
>>>> don't
>>>>>>>> need
>>>>>>>>>>>>> to provide them a way to pass in a "hint"
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hence, it should be sufficient to support:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> // user does not care
>>>>>>>>>>>>> 
>>>>>>>>>>>>> `stream.groupByKey(Grouped)`
>>>>>>>>>>>>> `stream.grouBy(..., Grouped)`
>>>>>>>>>>>>> 
>>>>>>>>>>>>> // user does care
>>>>>>>>>>>>> 
>>>>>>>>>>>>> `stream.repartition(Repartitioned).groupByKey()`
>>>>>>>>>>>>> `streams.groupBy(..., Repartitioned)`
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 11/9/19 8:10 PM, John Roesler wrote:
>>>>>>>>>>>>>> Thanks for those thoughts, Matthias,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I find your reasoning about the optimization behavior
>>>> compelling.
>>>>>>>> The
>>>>>>>>>>>>>> `through` operation is very simple and clear to reason about.
>>>> It
>>>>>>>> just
>>>>>>>>>>>>>> passes the data exactly at the specified point in the topology
>>>>>>>> exactly
>>>>>>>>>>>>>> through the specified topic. Likewise, if a user invokes a
>>>>>>>>>>>>>> `repartition` operator, the simplest behavior is if we just do
>>>> what
>>>>>>>>>>>>>> they asked for.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Stepping back to think about when optimizations are surprising
>>>> and
>>>>>>>>>>>>>> when they aren't, it occurs to me that we should be free to
>>>> move
>>>>>>>>>>>>>> around repartitions when users have asked to perform some
>>>> operation
>>>>>>>>>>>>>> that implies a repartition, like "change keys, then filter,
>>>> then
>>>>>>>>>>>>>> aggregate". This program requires a repartition, but it could
>>>> be
>>>>>>>>>>>>>> anywhere between the key change and the aggregation. On the
>>>> other
>>>>>>>>>>>>>> hand, if they say, "change keys, then filter, then
>>>> repartition, then
>>>>>>>>>>>>>> aggregate", it seems like they were pretty clear about their
>>>> desire,
>>>>>>>>>>>>>> and we should just take it at face value.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So, I'm sold on just literally doing a repartition every time
>>>> they
>>>>>>>>>>>>>> invoke the `repartition` operator.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The "partition count" modifier for `groupBy`/`groupByKey` is
>>>> more
>>>>>>>> nuanced.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> What you said about `groupByKey` makes sense. If they key
>>>> hasn't
>>>>>>>>>>>>>> actually changed, then we don't need to repartition before
>>>>>>>>>>>>>> aggregating. On the other hand, `groupBy` is specifically
>>>> changing
>>>>>>>> the
>>>>>>>>>>>>>> key as part of the grouping operation, so (as you said) we
>>>>>>>> definitely
>>>>>>>>>>>>>> have to do a repartition.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If I'm reading the discussion right, it seems like we do want
>>>> to
>>>>>>>> allow
>>>>>>>>>>>>>> people to optionally specify a partition count as part of this
>>>>>>>>>>>>>> operation, but we don't want that option to _force_
>>>> repartitioning
>>>>>>>> if
>>>>>>>>>>>>>> it's not needed. That last clause is the key. "Use P
>>>> partitions if
>>>>>>>>>>>>>> repartitioning is necessary" is a directive that applies
>>>> cleanly and
>>>>>>>>>>>>>> correctly to both `groupBy` and `groupByKey`. What if we call
>>>> the
>>>>>>>>>>>>>> option `numberOfPartitionsHint`, which along with the "if
>>>> necessary"
>>>>>>>>>>>>>> javadoc, should make it clear that the option won't force a
>>>>>>>>>>>>>> repartition, and also gives us enough latitude to still employ
>>>> the
>>>>>>>>>>>>>> optimizer on those repartition topics?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If we like the idea of expressing it as a "hint" for grouping
>>>> and a
>>>>>>>>>>>>>> "command" for `repartition`, then it seems like it still makes
>>>> sense
>>>>>>>>>>>>>> to keep Grouped and Repartitioned separate, as they would
>>>> actually
>>>>>>>>>>>>>> offer different methods with distinct semantics.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> WDYT?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <
>>>>>>>> matthias@confluent.io> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sorry for late reply.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I guess, the question boils down to the intended semantics of
>>>>>>>>>>>>>>> `repartition()`. My understanding is as follows:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> - KS does auto-repartitioning for correctness reasons (using
>>>> the
>>>>>>>>>>>>>>> upstream topic to determine the number of partitions)
>>>>>>>>>>>>>>> - KS does auto-repartitioning only for downstream DSL
>>>> operators
>>>>>>>> like
>>>>>>>>>>>>>>> `count()` (eg, a `transform()` does never trigger an
>>>>>>>> auto-repartitioning
>>>>>>>>>>>>>>> even if the stream is marked as `repartitioningRequired`).
>>>>>>>>>>>>>>> - KS offers `through()` to enforce a repartitioning --
>>>> however,
>>>>>>>> the user
>>>>>>>>>>>>>>> needs to create the topic manually (with the desired number of
>>>>>>>> partitions).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I see two main applications for `repartitioning()`:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1) repartition data before a `transform()` but user does not
>>>> want
>>>>>>>> to
>>>>>>>>>>>>>>> manage the topic
>>>>>>>>>>>>>>> 2) scale out a downstream subtopology
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hence, I see `repartition()` similar to `through()`: if a
>>>> users
>>>>>>>> calls
>>>>>>>>>>>>>>> it, a repartitining is enforced, with the difference that KS
>>>>>>>> manages the
>>>>>>>>>>>>>>> topic and the user does not need to create it.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This behavior makes (1) and (2) possible.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I think many users would prefer to just say "if there *is* a
>>>>>>>> repartition
>>>>>>>>>>>>>>>> required at this point in the topology, it should
>>>>>>>>>>>>>>>> have N partitions"
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Because of (2), I disagree. Either a user does not care about
>>>>>>>> scaling
>>>>>>>>>>>>>>> out, for which case she would not specify the number of
>>>>>>>> partitions. Or a
>>>>>>>>>>>>>>> user does care, and hence wants to enforce the scale out. I
>>>> don't
>>>>>>>> think
>>>>>>>>>>>>>>> that any user would say, "maybe scale out".
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Therefore, the optimizer should never ignore the repartition
>>>>>>>> operation.
>>>>>>>>>>>>>>> As a "consequence" (because repartitioning is expensive) a
>>>> user
>>>>>>>> should
>>>>>>>>>>>>>>> make an explicit call to `repartition()` IMHO -- piggybacking
>>>> an
>>>>>>>>>>>>>>> enforced repartitioning into `groupByKey()` seems to be
>>>> "dangerous"
>>>>>>>>>>>>>>> because it might be too subtle and an "optional scaling out"
>>>> as
>>>>>>>> laid out
>>>>>>>>>>>>>>> above does not make sense IMHO.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I am also not worried about "over repartitioning" because the
>>>>>>>> result
>>>>>>>>>>>>>>> stream would never trigger auto-repartitioning. Only if
>>>> multiple
>>>>>>>>>>>>>>> consecutive calls to `repartition()` are made it could be bad
>>>> --
>>>>>>>> but
>>>>>>>>>>>>>>> that's the same with `through()`. In the end, there is always
>>>> some
>>>>>>>>>>>>>>> responsibility on the user.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Btw, for `.groupBy()` we know that repartitioning will be
>>>> required,
>>>>>>>>>>>>>>> however, for `groupByKey()` it depends if the KStream is
>>>> marked as
>>>>>>>>>>>>>>> `repartitioningRequired`.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hence, for `groupByKey()` it should not be possible for a
>>>> user to
>>>>>>>> set
>>>>>>>>>>>>>>> number of partitions IMHO. For `groupBy()` it's a different
>>>> story,
>>>>>>>>>>>>>>> because calling
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> `repartition().groupBy()`
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> does not achieve what we want. Hence, allowing users to pass
>>>> in the
>>>>>>>>>>>>>>> number of users partitions into `groupBy()` does actually
>>>> makes
>>>>>>>> sense,
>>>>>>>>>>>>>>> because repartitioning will happen anyway and thus we can
>>>>>>>> piggyback a
>>>>>>>>>>>>>>> scaling decision.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I think that John has a fair concern about the overloads,
>>>> however,
>>>>>>>> I am
>>>>>>>>>>>>>>> not convinced that using `Grouped` to specify the number of
>>>>>>>> partitions
>>>>>>>>>>>>>>> is intuitive. I double checked `Grouped` and `Repartitioned`
>>>> and
>>>>>>>> both
>>>>>>>>>>>>>>> allow to specify a `name` and `keySerde/valueSerde`. Thus, I
>>>> am
>>>>>>>>>>>>>>> wondering if we could bridge the gap between both, if we
>>>> would make
>>>>>>>>>>>>>>> `Repartitioned extends Grouped`? For this case, we only need
>>>>>>>>>>>>>>> `groupBy(Grouped)` and a user can pass in both types what
>>>> seems to
>>>>>>>> make
>>>>>>>>>>>>>>> the API quite smooth:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> `stream.groupBy(..., Grouped...)`
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> `stream.groupBy(..., Repartitioned...)`
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>> Hi Sophie,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thank you for your reply, very insightful. Looking forward
>>>>>>>> hearing others opinion as well on this.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <
>>>>>>>> sophie@confluent.io> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
>>>>>>>> other hand
>>>>>>>>>>>>>>>>> Kafka Streams has already
>>>>>>>>>>>>>>>>>> optimizer in place which alters topology independently
>>>> from user
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I agree (with you) and think this is a good way to put it
>>>> -- we
>>>>>>>> currently
>>>>>>>>>>>>>>>>> auto-repartition for the user so
>>>>>>>>>>>>>>>>> that they don't have to walk through their entire topology
>>>> and
>>>>>>>> reason about
>>>>>>>>>>>>>>>>> when and where to place a
>>>>>>>>>>>>>>>>> `.through` (or the new `.repartition`), so why suddenly
>>>> force
>>>>>>>> this onto the
>>>>>>>>>>>>>>>>> user? How certain are we that
>>>>>>>>>>>>>>>>> users will always get this right? It's easy to imagine that
>>>>>>>> during
>>>>>>>>>>>>>>>>> development, you write your new app with
>>>>>>>>>>>>>>>>> correctly placed repartitions in order to use this new
>>>> feature.
>>>>>>>> During the
>>>>>>>>>>>>>>>>> course of development you end up
>>>>>>>>>>>>>>>>> tweaking the topology, but don't remember to review or move
>>>> the
>>>>>>>>>>>>>>>>> repartitioning since you're used to Streams
>>>>>>>>>>>>>>>>> doing this for you. If you use only single-partition topics
>>>> for
>>>>>>>> testing,
>>>>>>>>>>>>>>>>> you might not even notice your app is
>>>>>>>>>>>>>>>>> spitting out incorrect results!
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Anyways, I feel pretty strongly that it would be weird to
>>>>>>>> introduce a new
>>>>>>>>>>>>>>>>> feature and say that to use it, you can't take
>>>>>>>>>>>>>>>>> advantage of this other feature anymore. Also, is it
>>>> possible our
>>>>>>>>>>>>>>>>> optimization framework could ever include an
>>>>>>>>>>>>>>>>> optimized repartitioning strategy that is better than what a
>>>>>>>> user could
>>>>>>>>>>>>>>>>> achieve by manually inserting repartitions?
>>>>>>>>>>>>>>>>> Do we expect users to have a deep understanding of the best
>>>> way
>>>>>>>> to
>>>>>>>>>>>>>>>>> repartition their particular topology, or is it
>>>>>>>>>>>>>>>>> likely they will end up over-repartitioning either due to
>>>> missed
>>>>>>>>>>>>>>>>> optimizations or unnecessary extra repartitions?
>>>>>>>>>>>>>>>>> I think many users would prefer to just say "if there *is* a
>>>>>>>> repartition
>>>>>>>>>>>>>>>>> required at this point in the topology, it should
>>>>>>>>>>>>>>>>> have N partitions"
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> As to the idea of adding `numberOfPartitions` to Grouped
>>>> rather
>>>>>>>> than
>>>>>>>>>>>>>>>>> adding a new parameter to groupBy, that does seem more in
>>>> line
>>>>>>>> with the
>>>>>>>>>>>>>>>>> current syntax so +1 from me
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <
>>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> While https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> is under
>>>> review and
>>>>>>>> it’s
>>>>>>>>>>>>>>>>>> almost done, I want to resurrect discussion about this KIP
>>>> to
>>>>>>>> address
>>>>>>>>>>>>>>>>>> couple of concerns raised by Matthias and John.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> As a reminder, idea of the KIP-221 was to allow DSL users
>>>>>>>> control over
>>>>>>>>>>>>>>>>>> repartitioning and parallelism of sub-topologies by:
>>>>>>>>>>>>>>>>>> 1) Introducing new KStream#repartition operation which is
>>>> done
>>>>>>>> in
>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>
>>>>>>>>>>>>>>>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which
>>>> is
>>>>>>>> planned to
>>>>>>>>>>>>>>>>>> be separate PR.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> While all agree about general implementation and idea
>>>> behind
>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> PR,
>>>> introducing new
>>>>>>>>>>>>>>>>>> KStream#groupBy(Repartitioned) method overload raised some
>>>>>>>> questions during
>>>>>>>>>>>>>>>>>> the review.
>>>>>>>>>>>>>>>>>> Matthias raised concern that there can be cases when user
>>>> uses
>>>>>>>>>>>>>>>>>> `KStream#groupBy(Repartitioned)` operation, but actual
>>>>>>>> repartitioning may
>>>>>>>>>>>>>>>>>> not required, thus configuration passed via `Repartitioned`
>>>>>>>> would never be
>>>>>>>>>>>>>>>>>> applied (Matthias, please correct me if I misinterpreted
>>>> your
>>>>>>>> comment).
>>>>>>>>>>>>>>>>>> So instead, if user wants to control parallelism of
>>>>>>>> sub-topologies, he or
>>>>>>>>>>>>>>>>>> she should always use `KStream#repartition` operation
>>>> before
>>>>>>>> groupBy. Full
>>>>>>>>>>>>>>>>>> comment can be seen here:
>>>>>>>>>>>>>>>>>> 
>>>>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
>>>>>>>>>>>>>>>>>> 
>>>>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On the same topic, John pointed out that, from API design
>>>>>>>> perspective, we
>>>>>>>>>>>>>>>>>> shouldn’t intertwine configuration classes of different
>>>>>>>> operators between
>>>>>>>>>>>>>>>>>> one another. So instead of introducing new
>>>>>>>> `KStream#groupBy(Repartitioned)`
>>>>>>>>>>>>>>>>>> for specifying number of partitions for internal topic, we
>>>>>>>> should update
>>>>>>>>>>>>>>>>>> existing `Grouped` class with `numberOfPartitions` field.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
>>>>>>>> other hand
>>>>>>>>>>>>>>>>>> Kafka Streams has already optimizer in place which alters
>>>>>>>> topology
>>>>>>>>>>>>>>>>>> independently from user. So maybe it makes sense if Kafka
>>>>>>>> Streams,
>>>>>>>>>>>>>>>>>> internally would optimize topology in the best way
>>>> possible,
>>>>>>>> even if in
>>>>>>>>>>>>>>>>>> some cases this means ignoring some operator configurations
>>>>>>>> passed by the
>>>>>>>>>>>>>>>>>> user. Also, I agree with John about API design semantics.
>>>> If we
>>>>>>>> go through
>>>>>>>>>>>>>>>>>> with the changes for `KStream#groupBy` operation, it makes
>>>> more
>>>>>>>> sense to
>>>>>>>>>>>>>>>>>> add `numberOfPartitions` field to `Grouped` class instead
>>>> of
>>>>>>>> introducing
>>>>>>>>>>>>>>>>>> new `KStream#groupBy(Repartitioned)` method overload.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I would really appreciate communities feedback on this.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <
>>>>>>>> sophie@confluent.io>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I think people are busy with the upcoming 2.4 release, and
>>>>>>>> don't have
>>>>>>>>>>>>>>>>>> much
>>>>>>>>>>>>>>>>>>> spare time at the
>>>>>>>>>>>>>>>>>>> moment. It's kind of a difficult time to get attention on
>>>>>>>> things, but
>>>>>>>>>>>>>>>>>> feel
>>>>>>>>>>>>>>>>>>> free to pick up something else
>>>>>>>>>>>>>>>>>>> to work on in the meantime until things have calmed down
>>>> a bit!
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Sorry for bringing this thread again, but I would like
>>>> to get
>>>>>>>> some
>>>>>>>>>>>>>>>>>>>> attention on this PR:
>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> <
>>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>>
>>>>>>>>>>>>>>>>>>>> It's been a while now and I would love to move on to
>>>> other
>>>>>>>> KIPs as well.
>>>>>>>>>>>>>>>>>>>> Please let me know if you have any concerns.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Here’s voting thread for this KIP:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>>>>>>>>>>>>>>> 
>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>
>>>>>>>> <
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>>>>>>>>>>>>>>> 
>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html
>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>>> levani.codes@gmail.com>>>
>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion
>>>> on
>>>>>>>> that one.
>>>>>>>>>>>>>>>>>>>>>> Agree that avoiding unnecessary method overloads is a
>>>> good
>>>>>>>> idea.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Updated KIP
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <
>>>>>>>> matthias@confluent.io
>>>>>>>>>>>>>>>>>> <ma...@confluent.io>
>>>>>>>>>>>>>>>>>>>> <mailto:matthias@confluent.io <mailto:
>>>> matthias@confluent.io>>>
>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> One question:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Why do we add
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Repartitioned#with(final String name, final int
>>>>>>>> numberOfPartitions)
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> It seems that `#with(String name)`,
>>>>>>>> `#numberOfPartitions(int)` in
>>>>>>>>>>>>>>>>>>>>>>> combination with `withName()` and
>>>>>>>> `withNumberOfPartitions()` should
>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>> sufficient. Users can chain the method calls.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> (I think it's valuable to keep the number of overload
>>>>>>>> small if
>>>>>>>>>>>>>>>>>>>> possible.)
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Otherwise LGTM.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks all for your feedback.
>>>>>>>>>>>>>>>>>>>>>>>> I started voting procedure for this KIP. If there’re
>>>> any
>>>>>>>> other
>>>>>>>>>>>>>>>>>>>> concerns about this KIP, please let me know.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>
>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, makes sense.
>>>>>>>>>>>>>>>>>>>>>>>>> I’ve updated KIP (
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> ).
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
>>>>>>>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>
>>>>>>>>>>>>>>>>>>>> <mailto:matthias@confluent.io <mailto:
>>>> matthias@confluent.io>>
>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>
>>>> <mailto:
>>>>>>>>>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>>>>
>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving the KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> I agree that users need to be able to specify a
>>>>>>>> partitioning
>>>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Sophie raises a fair point about topic configs and
>>>>>>>> producer
>>>>>>>>>>>>>>>>>>>> configs. My
>>>>>>>>>>>>>>>>>>>>>>>>>> take is, that consider `Repartitioned` as an
>>>>>>>> "extension" to
>>>>>>>>>>>>>>>>>>>> `Produced`,
>>>>>>>>>>>>>>>>>>>>>>>>>> that adds topic configuration, is a good way to
>>>> think
>>>>>>>> about it and
>>>>>>>>>>>>>>>>>>>> helps
>>>>>>>>>>>>>>>>>>>>>>>>>> to keep the API "clean".
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> With regard to method names. I would prefer to
>>>> avoid
>>>>>>>>>>>>>>>>>> abbreviations.
>>>>>>>>>>>>>>>>>>>> Can
>>>>>>>>>>>>>>>>>>>>>>>>>> we rename:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Furthermore, it might be good to add some more
>>>> `static`
>>>>>>>> methods:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>>>>>>>>>>>>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>>>>>>>>>>>>>>>>>>>>>>> -
>>>> Repartitioned.streamPartitioner(StreamPartitioner)
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> Totally agree. I think in KStream interface it
>>>> makes
>>>>>>>> sense to
>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>> some duplicate configurations between operators in order
>>>> to
>>>>>>>> keep API
>>>>>>>>>>>>>>>>>> simple
>>>>>>>>>>>>>>>>>>>> and usable.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, as more surface API has, harder it is to
>>>> have
>>>>>>>> proper
>>>>>>>>>>>>>>>>>>>> backward compatibility.
>>>>>>>>>>>>>>>>>>>>>>>>>>> While initial idea of keeping topic level configs
>>>>>>>> separate was
>>>>>>>>>>>>>>>>>>>> exciting, having Repartitioned class encapsulate some
>>>>>>>> producer level
>>>>>>>>>>>>>>>>>>>> configs makes API more readable.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman
>>>> <
>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
>>>> <mailto:
>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
>>>> <mailto:
>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that is a good point about trying to keep
>>>>>>>> producer level
>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations and (repartition) topic level
>>>>>>>> considerations
>>>>>>>>>>>>>>>>>>>> separate.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Number of partitions is definitely purely a topic
>>>>>>>> level
>>>>>>>>>>>>>>>>>>>> configuration. But
>>>>>>>>>>>>>>>>>>>>>>>>>>>> on some level, serdes and partitioners are just
>>>> as
>>>>>>>> much a topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration as a producer one. You could have
>>>> two
>>>>>>>> producers
>>>>>>>>>>>>>>>>>>>> configured
>>>>>>>>>>>>>>>>>>>>>>>>>>>> with different serdes and/or partitioners, but if
>>>>>>>> they are
>>>>>>>>>>>>>>>>>>>> writing to the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> same topic the result would be very difficult to
>>>>>>>> part. So in a
>>>>>>>>>>>>>>>>>>>> sense, these
>>>>>>>>>>>>>>>>>>>>>>>>>>>> are configurations of topics in Streams, not just
>>>>>>>> producers.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another way to think of it: while the Streams
>>>> API is
>>>>>>>> not always
>>>>>>>>>>>>>>>>>>>> true to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> this, ideally all the relevant configs for an
>>>>>>>> operator are
>>>>>>>>>>>>>>>>>>>> wrapped into a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> single object (in this case, Repartitioned). We
>>>> could
>>>>>>>> instead
>>>>>>>>>>>>>>>>>>>> split out the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> fields in common with Produced into a separate
>>>>>>>> parameter to keep
>>>>>>>>>>>>>>>>>>>> topic and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> producer level configurations separate, but this
>>>>>>>> increases the
>>>>>>>>>>>>>>>>>>>> API surface
>>>>>>>>>>>>>>>>>>>>>>>>>>>> area by a lot. It's much more straightforward to
>>>> just
>>>>>>>> say "this
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything that this particular operator needs"
>>>>>>>> without worrying
>>>>>>>>>>>>>>>>>>>> about what
>>>>>>>>>>>>>>>>>>>>>>>>>>>> exactly you're specifying.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suppose you could alternatively make Produced a
>>>>>>>> field of
>>>>>>>>>>>>>>>>>>>> Repartitioned,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> but I don't think we do this kind of composition
>>>>>>>> elsewhere in
>>>>>>>>>>>>>>>>>>>> Streams at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the moment
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani
>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Bill,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
>>>>>>>>>>>>>>>>>>>> `Repartitioned#partitioner`
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class
>>>> for
>>>>>>>> topic level
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration and keep topic level and producer
>>>> level
>>>>>>>>>>>>>>>>>>>> configurations (such
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as Produced) separately (see my second email in
>>>> this
>>>>>>>> thread).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But while looking at the semantics of KStream
>>>>>>>> interface, I
>>>>>>>>>>>>>>>>>>>> couldn’t really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> figure out good operation name for Topic level
>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>> class and just
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing `Topic` config class was kinda
>>>> breaking
>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So I think having Repartitioned class which
>>>>>>>> encapsulates topic
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> producer level configurations for internal
>>>> topics is
>>>>>>>> viable
>>>>>>>>>>>>>>>>>>>> thing to do.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <
>>>>>>>> bbejeck@gmail.com
>>>>>>>>>>>>>>>>>> <ma...@gmail.com>
>>>>>>>>>>>>>>>>>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>>
>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>>>>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Lavani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.
>>>> In
>>>>>>>> addition to
>>>>>>>>>>>>>>>>>>>> the reason
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Users may want to use something other than
>>>>>>>> hash-based
>>>>>>>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Users may wish to partition on something
>>>>>>>> different than the
>>>>>>>>>>>>>>>>>>>> key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. A combination of fields in the value in
>>>>>>>> conjunction with
>>>>>>>>>>>>>>>>>>>> the key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Something other than the key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on
>>>>>>>> Produced hence
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes
>>>> sense
>>>>>>>> for API
>>>>>>>>>>>>>>>>>>>> consistency.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just my  2 cents.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani
>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>>> levani.codes@gmail.com>>
>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we add partitioner configuration to
>>>>>>>> Repartitioned class,
>>>>>>>>>>>>>>>>>>>> with the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> combination of specifying number of
>>>> partitions for
>>>>>>>> internal
>>>>>>>>>>>>>>>>>>>> topics, user
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will have opportunity to ensure
>>>> co-partitioning
>>>>>>>> before join
>>>>>>>>>>>>>>>>>>>> operation.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
>>>>>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io>
>>>>>>>>>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>
>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind.
>>>> Again,
>>>>>>>> not totally
>>>>>>>>>>>>>>>>>>>> sure it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar
>>>> is
>>>>>>>> the
>>>>>>>>>>>>>>>>>> rationale
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani
>>>> Kokhreidze
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
>>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>>> levani.codes@gmail.com>>
>>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>>> levani.codes@gmail.com>
>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey John,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your
>>>> example
>>>>>>>> I would
>>>>>>>>>>>>>>>>>>>> first issue
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper
>>>> partitioner
>>>>>>>> that
>>>>>>>>>>>>>>>>>>>> essentially
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be the same as the topic I want to join with
>>>> and
>>>>>>>> then do the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#join
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with DSL?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler
>>>> <
>>>>>>>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>> john@confluent.io
>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> <mailto:john@confluent.io
>>>> <mailto:
>>>>>>>>>>>>>>>>>> john@confluent.io>
>>>>>>>>>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an
>>>> option to
>>>>>>>> specify
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is
>>>> that
>>>>>>>> some data may
>>>>>>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input
>>>>>>>> topic. If the
>>>>>>>>>>>>>>>>>>>> right-side
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning
>>>>>>>> strategy, then the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be
>>>>>>>> partitioned with the
>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss
>>>>>>>> something
>>>>>>>>>>>>>>>>>>>> important?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani
>>>>>>>> Kokhreidze
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
>>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>>> levani.codes@gmail.com>>
>>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>>> levani.codes@gmail.com>
>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To
>>>> be
>>>>>>>> honest I’m
>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really
>>>>>>>> think I would
>>>>>>>>>>>>>>>>>>>> need control
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka
>>>> Streams
>>>>>>>> knows best
>>>>>>>>>>>>>>>>>>>> how to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should
>>>> be
>>>>>>>> used only for
>>>>>>>>>>>>>>>>>>>> topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have
>>>>>>>> possibility to
>>>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet,
>>>> but I
>>>>>>>> guess
>>>>>>>>>>>>>>>>>>>> specifying the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s
>>>>>>>> agreement on
>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie
>>>>>>>> Blee-Goldman <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <mailto:
>>>> sophie@confluent.io>
>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that
>>>>>>>> Repartitioned
>>>>>>>>>>>>>>>>>>>> would be a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also
>>>> need
>>>>>>>> to have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field,
>>>> similar
>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> Produced? I'm not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sure how
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but
>>>> seems
>>>>>>>> it should
>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani
>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>>> levani.codes@gmail.com>
>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by
>>>> Kafka
>>>>>>>> Streams.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user
>>>> more
>>>>>>>> control over
>>>>>>>>>>>>>>>>>>>> the topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is
>>>>>>>> something that
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> missing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over
>>>>>>>> parallelism by
>>>>>>>>>>>>>>>>>>>> configuring
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie
>>>>>>>> Blee-Goldman <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <mailto:
>>>> sophie@confluent.io>
>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one
>>>>>>>> thing for me
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> for the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a
>>>>>>>> Repartitioned,
>>>>>>>>>>>>>>>>>>>> will the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems
>>>> to be
>>>>>>>> the case
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> signature
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it
>>>> have to
>>>>>>>> be
>>>>>>>>>>>>>>>>>>>> pre-created? The
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wording
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version
>>>> of
>>>>>>>> the method
>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani
>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>>> levani.codes@gmail.com>
>>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing
>>>> list :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities
>>>>>>>> opinions/concerns
>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>> this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani
>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani
>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>>
>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward,
>>>> I’ve
>>>>>>>> updated
>>>>>>>>>>>>>>>>>>>> confluence
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> page
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected
>>>> Alternatives”
>>>>>>>> section.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP
>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani
>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>>
>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing
>>>>>>>> dedicated `Topic`
>>>>>>>>>>>>>>>>>>>> class for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators
>>>> like
>>>>>>>>>>>>>>>>>> `groupedBy`.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others
>>>> opinion
>>>>>>>> about this
>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM,
>>>> Matthias
>>>>>>>> J. Sax <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> matthias@confluent.io
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>>
>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP!
>>>> And
>>>>>>>> thanks for
>>>>>>>>>>>>>>>>>>>> summarizing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been
>>>>>>>> discussed
>>>>>>>>>>>>>>>>>>>> already (can't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a
>>>> good
>>>>>>>> summary to
>>>>>>>>>>>>>>>>>>>> refresh the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes
>>>> sense.
>>>>>>>> With regard
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage
>>>> topics
>>>>>>>> and
>>>>>>>>>>>>>>>>>> operators
>>>>>>>>>>>>>>>>>>>> that use
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument,
>>>> it
>>>>>>>> might
>>>>>>>>>>>>>>>>>> indicate
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator
>>>> that
>>>>>>>> uses
>>>>>>>>>>>>>>>>>>>> use-defined
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()`
>>>>>>>> operator (an
>>>>>>>>>>>>>>>>>>>> operator that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good.
>>>>>>>> Otherwise, there is
>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages
>>>>>>>> topics but
>>>>>>>>>>>>>>>>>>>> sometimes
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not; a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make
>>>> the
>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>> clearer.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to
>>>>>>>> `Grouped`. I am
>>>>>>>>>>>>>>>>>>>> wondering
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does
>>>> apply
>>>>>>>> and adding
>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get
>>>>>>>> opinions of
>>>>>>>>>>>>>>>>>>>> others on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I
>>>> prefer
>>>>>>>> atm.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration
>>>> objects
>>>>>>>> that allow
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configure
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer,
>>>> producer,
>>>>>>>> store. If we
>>>>>>>>>>>>>>>>>>>> assume that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if
>>>> we
>>>>>>>> should have a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()`
>>>> class
>>>>>>>> and method
>>>>>>>>>>>>>>>>>>>> instead of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to
>>>> add
>>>>>>>> other
>>>>>>>>>>>>>>>>>> configs,
>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc,
>>>> easily,
>>>>>>>> without the
>>>>>>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not
>>>> sure
>>>>>>>> if I like
>>>>>>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> myself.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani
>>>> Kokhreidze
>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though -
>>>>>>>> maybe
>>>>>>>>>>>>>>>>>> enhancing
>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the
>>>>>>>> best approach.
>>>>>>>>>>>>>>>>>>>> Let me
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class
>>>> with
>>>>>>>> this
>>>>>>>>>>>>>>>>>>>> configuration,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation.
>>>> Since
>>>>>>>> KStream#to is
>>>>>>>>>>>>>>>>>>>> the final
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sink of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be
>>>> reasonable
>>>>>>>>>>>>>>>>>> assumption
>>>>>>>>>>>>>>>>>>>> that user
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in
>>>> advance. And
>>>>>>>> in that
>>>>>>>>>>>>>>>>>>>> case, having
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make
>>>> much
>>>>>>>> sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class,
>>>> based
>>>>>>>> on API
>>>>>>>>>>>>>>>>>>>> contract, seems
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something
>>>> that
>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> explicitly for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer,
>>>> partitioner
>>>>>>>> those all
>>>>>>>>>>>>>>>>>>>> are producer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions
>>>> is
>>>>>>>> topic level
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer
>>>> level
>>>>>>>>>>>>>>>>>>>> configurations
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> together in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface,
>>>>>>>> seems like
>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with
>>>> non-internal
>>>>>>>> (e.g
>>>>>>>>>>>>>>>>>> topics
>>>>>>>>>>>>>>>>>>>> created
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics
>>>> and I
>>>>>>>> think we
>>>>>>>>>>>>>>>>>>>> should leave
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into
>>>> account,
>>>>>>>> I think we
>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka
>>>>>>>> Streams should
>>>>>>>>>>>>>>>>>>>> create and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs
>>>>>>>> topics that
>>>>>>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding
>>>>>>>> numPartitions
>>>>>>>>>>>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and
>>>> producer
>>>>>>>> level
>>>>>>>>>>>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API
>>>>>>>> semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name
>>>> optional
>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> KStream#through - I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and
>>>> giving
>>>>>>>> users
>>>>>>>>>>>>>>>>>>>> possibility to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more
>>>> useful :)
>>>>>>>>>>>>>>>>>> Considering
>>>>>>>>>>>>>>>>>>>> arguments
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced
>>>>>>>> class, I see two
>>>>>>>>>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
>>>>>>>> auto-generated and
>>>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
>>>> numOfPartitions)
>>>>>>>> - topic
>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of
>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
>>>> numOfPartitions,
>>>>>>>> final
>>>>>>>>>>>>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with
>>>> generated
>>>>>>>> with
>>>>>>>>>>>>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
>>>>>>>> parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it
>>>> is and
>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>>> new method
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias
>>>>>>>> suggested this
>>>>>>>>>>>>>>>>>>>> in one of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I
>>>>>>>> propose the
>>>>>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
>>>>>>>> configuration to
>>>>>>>>>>>>>>>>>> Grouped
>>>>>>>>>>>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method
>>>> overloads to
>>>>>>>>>>>>>>>>>>>> KStream#through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
>>>>>>>> auto-generated and
>>>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
>>>> numOfPartitions)
>>>>>>>> - topic
>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of
>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
>>>> numOfPartitions,
>>>>>>>> final
>>>>>>>>>>>>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with
>>>> generated
>>>>>>>> with
>>>>>>>>>>>>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
>>>>>>>> parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
>>>>>>>> configuration to
>>>>>>>>>>>>>>>>>> Grouped
>>>>>>>>>>>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator
>>>>>>>> KStream#repartition for
>>>>>>>>>>>>>>>>>>>> creating and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was
>>>>>>>> already
>>>>>>>>>>>>>>>>>>>> discussed in the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the
>>>> threads
>>>>>>>> that were
>>>>>>>>>>>>>>>>>>>> about this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM,
>>>> Levani
>>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>>>>>>>>>>>>>>> levani.codes@gmail.com>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect
>>>> discussion
>>>>>>>> around
>>>>>>>>>>>>>>>>>>>> KIP-221. Going
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems
>>>> to
>>>>>>>> agreement
>>>>>>>>>>>>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation,
>>>> as far
>>>>>>>> as I
>>>>>>>>>>>>>>>>>>>> understood, the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the
>>>>>>>> following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to
>>>>>>>> KStream#through
>>>>>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with
>>>>>>>> numOfPartitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL
>>>>>>>> users to
>>>>>>>>>>>>>>>>>> control
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing
>>>> stateful
>>>>>>>>>>>>>>>>>> operations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface
>>>>>>>> changes around
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
> 


Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by "Matthias J. Sax" <ma...@confluent.io>.
Levani,

do you agree to the current proposal? It's basically a de-scoping of the
already voted KIP. If you agree, could you update the KIP wiki page
accordingly, including the "Rejected Alternative" section (and mabye a
link to a follow up Jira ticket).

Because it's a descope, and John and myself support it, there seems to
be no need to re-vote.

@Sophie,John: thanks a lot for your thoughtful input!


-Matthias

On 11/15/19 12:47 PM, John Roesler wrote:
> Thanks Sophie,
> 
> I think your concern is valid, and also that your idea to make a
> ticket is a good idea.
> 
> Creating a ticket has some very positive effects:
> * It allows us to record the thinking at this point in time so we
> don't have to dig through the mail archives later
> * It demonstrates that we did consider the use case, and do want to
> address it, but just don't feel confident to implement it right now.
> Then, if/when people do have a problem with the gap, the ticket it
> already there for them to consider, request, or even pick up.
> 
> Since one aspect of the deferral is a desire to wait for real use
> experience, we should explicitly mention that in the ticket. This is
> just good information for people browsing the Jira looking for
> interesting tickets to pick up. They could still pick it up, but they
> can ask themselves if they really understand the real-world use cases
> any better than we do right now.
> 
> Thanks, likewise, to you for the good discussion!
> -John
> 
> On Fri, Nov 15, 2019 at 2:37 PM Sophie Blee-Goldman <so...@confluent.io> wrote:
>>
>> While I'm concerned that "not augmenting groupBy as part of this KIP"
>> really translates to "will not get around to augmenting groupBy for a long
>> time if not as part of this KIP", like I said I don't want to hold up the
>> new
>> .repartition operator that it seems we do, at least, all agree on. It's a
>> fair
>> point that we can always add this in later, but undoing it is far more
>> problematic.
>>
>> Anyways, I would be happy if we at least make a ticket to consider adding a
>> "number of partitions" option/suggestion to groupBy, so that we don't lose
>> all the thought put in to this decision so far and can avoid rehashing the
>> same
>> argument word for word and have something to point to when someone
>> asks "why didn't we add this numPartitions option to groupBy".
>>
>> Beyond that, if the community isn't pushing for it at this moment then it
>> seems very
>> reasonable to shelve the idea for now so that the rest of this KIP can
>> proceed.
>> Without input one way or another it's hard to say what the right thing to
>> do is,
>> which makes the right thing to do "wait to add this feature"
>>
>> Thanks for the good discussion everyone,
>>
>> Sophie
>>
>> On Fri, Nov 15, 2019 at 12:41 PM John Roesler <jo...@confluent.io> wrote:
>>
>>> Hi all,
>>>
>>> I think that Sophie is asking a good question, and I do think that
>>> such "blanket configurations" are plausible. For example, we currently
>>> support (and I would encourage) "I don't know if this is going to
>>> create a repartition topic, but if it does, then use this name instead
>>> of generating one".
>>>
>>> I'm not sure I'm convinced that specifying max parallelism falls into
>>> this category. After all, the groupByKey+aggregate will be executed
>>> with _some_ max parallelism. It's either the same as the inputs'
>>> partition count or overridden with the proposed config. It seems
>>> counterintuitive to override the specified option with the default
>>> value.
>>>
>>> I'm not sure if I can put my finger on it, but "maybe use this name"
>>> seems way more reasonable to me than "maybe execute with this degree
>>> of parallelism".
>>>
>>> I do think (and I appreciate that this is where Sophie's example is
>>> coming from) that Streams should strive to be absolutely as simple and
>>> intuitive as possible (while still maintaining correctness). Optimal
>>> performance can be at odds with API simplicity. For example, the
>>> simplest behavior is, if you ask for 5 partitions, you get 5
>>> partitions. Maybe a repartition is technically not necessary (if you
>>> didn't change the key), but at least there's no mystery to this
>>> behavior.
>>>
>>> Clearly, an (opposing) tenent of simplicity is trying to prevent
>>> people from making mistakes, which I think is what the example boils
>>> down to. Sometimes, we can prevent clear mistakes, like equi-joining
>>> two topics with different partition counts. But for this case, it
>>> doesn't seem as clear-cut to be able to assume that they _said_ 5
>>> partitions, but they didn't really _want_ 5 partitions. Maybe we can
>>> just try to be clear in the documentation, and also even log a warning
>>> when we parse the topology, "hey, I've been asked to repartition this
>>> stream, but it's not necessary".
>>>
>>> If anything, this discussion really supports to me the value in just
>>> sticking with `repartition()` for now, and deferring
>>> `groupBy[Key](partitions)` to the future.
>>>
>>>> Users should not have to choose between allowing Streams to optimize the
>>> repartition placement, and allowing to specify a number of partitions.
>>>
>>> This is a very fair point, and it may be something that we rapidly
>>> return to, but it seems safe for now to introduce the non-optimizable
>>> `reparition()` only, and then consider optimization options later.
>>> Skipping available optimizations will never break correctness, but
>>> adding optimizations can, so it makes sense to treat them with
>>> caution.
>>>
>>> In conclusion, I do think that a use _could_ want to "maybe specify"
>>> the partition count, but I also think we can afford to pass on
>>> supporting this right now.
>>>
>>> I'm open to continuing the discussion, but just to avoid ambiguity, I
>>> still feel we should _not_ change the groupBy[Key] operation at all,
>>> and we should only add `repartition()` as a non-optimizable operation.
>>>
>>> Thanks all,
>>> -John
>>>
>>> On Fri, Nov 15, 2019 at 11:26 AM Levani Kokhreidze
>>> <le...@gmail.com> wrote:
>>>>
>>>> Hello,
>>>>
>>>> Just fyi, PR was updated and now it incorporates the latest suggestions
>>> about joins.
>>>> `CopartitionedTopicsEnforcer` will throw an exception if number of
>>> partitions aren’t the same when using `repartition` operation along with
>>> `join`.
>>>>
>>>> For more details please take a look at the PR:
>>> https://github.com/apache/kafka/pull/7170/files <
>>> https://github.com/apache/kafka/pull/7170/files>
>>>>
>>>> Regards,
>>>> Levani
>>>>
>>>>
>>>>> On Nov 15, 2019, at 11:01 AM, Matthias J. Sax <ma...@confluent.io>
>>> wrote:
>>>>>
>>>>> Thanks a lot for the input Sophie.
>>>>>
>>>>> Your example is quite useful, and I would use it to support my claim
>>>>> that a "partition hint" for `Grouped` seems "useless" and does not
>>>>> improve the user experience.
>>>>>
>>>>> 1) You argue that a new user would be worries about repartitions topics
>>>>> with too many paritions. This would imply that a user is already
>>>>> advanced enough to understand the implication of repartitioning -- for
>>>>> this case, I would argue that a user also understand _when_ a
>>>>> auto-repartitioning would happen and thus the users understands where
>>> to
>>>>> insert a `repartition()` operation.
>>>>>
>>>>> 2) For specifying Serdes: if a `groupByKey()` does not trigger
>>>>> auto-repartitioning it's not required to specify the serdes and if they
>>>>> are specified they would be ignored/unused (note, that `groupBy()`
>>> would
>>>>> always trigger a repartitioning). Of course, if the default Serdes from
>>>>> the config match (eg, all data types are Json anyway), a user does not
>>>>> need to worry about specifying serdes. -- For new user that play
>>> around,
>>>>> I would assume that they work a lot with primitive types and thus would
>>>>> need to specify the serdes -- hence, they would learn about
>>>>> auto-repartitioning the hard way anyhow, because each time a
>>>>> `groupByKey()` does trigger auto-repartioning, they would need to pass
>>>>> in the correct Serdes -- this way, they would also be educated where to
>>>>> insert a `repartition()` operator if needed.
>>>>>
>>>>> 3) If a new user really just "plays around", I don't think they use an
>>>>> input topic with 100 partitions but most likely have a local single
>>> node
>>>>> broker with most likely single partitions topics.
>>>>>
>>>>>
>>>>> My main argument for my current proposal is however, that---based on
>>>>> past experience---it's better to roll out a new feature more carefully
>>>>> and see how it goes. Last, as John pointed out, we can still extend the
>>>>> feature in the future. Instead of making a judgment call up-front,
>>> being
>>>>> more conservative and less fancy, and revisit the design based on
>>>>> actuall user feedback after the first version is rolled out, seems to
>>> be
>>>>> the better option. Undoing a feature is must harder than extending it.
>>>>>
>>>>>
>>>>> While I advocate strong for a simple first version of this feature,
>>> it's
>>>>> a community decission in the end, and I would not block this KIP if
>>>>> there is a broad preference to add `Grouped#withNumberOfPartitions()`
>>>>> either.
>>>>>
>>>>>
>>>>> -Matthias
>>>>>
>>>>> On 11/14/19 11:35 PM, Sophie Blee-Goldman wrote:
>>>>>> It seems like we all agree at this point (please correct me if
>>> wrong!) that
>>>>>> we should NOT change
>>>>>> the existing repartitioning behavior, ie we should allow Streams to
>>>>>> continue to determine when and
>>>>>> where to repartition -- *unless* explicitly informed to by the use of
>>> a
>>>>>> .through or the new .repartition operator.
>>>>>>
>>>>>> Regarding groupBy, the existing behavior we should not disrupt is
>>>>>> a) repartition *only* when required due to upstream key-changing
>>> operation
>>>>>> (ie don't force repartitioning
>>>>>> based on the presence of an optional config parameter), and
>>>>>> b) allow optimization of required repartitions, if any
>>>>>>
>>>>>> Within the constraint of not breaking the existing behavior, this
>>> still
>>>>>> leaves open the question of whether we
>>>>>> want to improve the user experience by allowing to provide groupBy
>>> with a
>>>>>> *suggestion* for numPartitions (or to
>>>>>> put it more fairly, whether that *will* improve the experience). I
>>> agree
>>>>>> with many of the arguments outlined above but
>>>>>> let me just push back on this one issue one final time, and if we
>>> can't
>>>>>> come to a consensus then I am happy to drop
>>>>>> it for now so that the KIP can proceed.
>>>>>>
>>>>>> Specifically, my proposal would be to simply augment Grouped with an
>>>>>> optional numPartitions, understood to
>>>>>> indicate the user's desired number of partitions *if Streams decides
>>> to
>>>>>> repartition due to that groupBy*
>>>>>>
>>>>>>> if a user cares about the number of partition, the user wants to
>>> enforce
>>>>>> a repartitioning
>>>>>> First, I think we should take a step back and examine this claim. I
>>> agree
>>>>>> 100% that *if this is true,*
>>>>>> *then we should not give groupBy an optional numPartitions.* As far
>>> as I
>>>>>> see it, there's no argument
>>>>>> to be had there if we *presuppose that claim.* But I'm not convinced
>>> in
>>>>>> that as an axiom of the user
>>>>>> experience and think we should be examining that claim itself, not the
>>>>>> consequences of it.
>>>>>>
>>>>>> To give a simple example, let's say some new user is trying out
>>> Streams and
>>>>>> wants to just play around
>>>>>> with it to see if it might be worth looking into. They want to just
>>> write
>>>>>> up a simple app and test it out on the
>>>>>> data in some existing topics they have with a large number of
>>> partitions,
>>>>>> and a lot of data. They're just messing
>>>>>> around, trying new topologies and don't want to go through each new
>>> one
>>>>>> step by step to determine if (or where)
>>>>>> a repartition might be required. They also don't want to force a
>>>>>> repartition if it turns out to not be required, so they'd
>>>>>> like to avoid the nice new .repartition operator they saw. But given
>>> the
>>>>>> huge number of input partitions, they'd like
>>>>>> to rest assured that if a repartition does end up being required
>>> somewhere
>>>>>> during dev, it will not be created with
>>>>>> the same huge number of partitions that their input topic has -- so
>>> they
>>>>>> just pass groupBy a small numPartitions
>>>>>> suggestion.
>>>>>>
>>>>>> I know that's a bit of a contrived example but I think it does
>>> highlight
>>>>>> how and when this might be a considerable
>>>>>> quality of life improvement, in particular for new users to Streams
>>> and/or
>>>>>> during the dev cycle. *You don't want to*
>>>>>> *force a repartition if it wasn't necessary, but you don't want to
>>> create a
>>>>>> topic with a huge partition count either.*
>>>>>>
>>>>>> Also, while the optimization discussion took us down an interesting
>>> but
>>>>>> ultimately more distracting road, it's worth
>>>>>> pointing out that it is clearly a major win to have as few
>>>>>> repartition topics/steps as possible. Given that we
>>>>>> don't want to change existing behavior, the optimization framework
>>> can only
>>>>>> help out when the placement of
>>>>>> repartition steps is flexible, which means only those from .groupBy
>>> (and
>>>>>> not .repartition). *Users should not*
>>>>>> *have to choose between allowing Streams to optimize the repartition
>>>>>> placement, and allowing to specify a *
>>>>>> *number of partitions.*
>>>>>>
>>>>>> Lastly, I have what may be a stupid question but for my own
>>> edification of
>>>>>> how groupBy works:
>>>>>> if you do a .groupBy and a repartition is NOT required, does it ever
>>> need
>>>>>> to serialize/deserialize
>>>>>> any of the data? In other words, if you pass a key/value serde to
>>> groupBy
>>>>>> and it doesn't trigger
>>>>>> a repartition, is the serde(s) just ignored and thus more like a
>>> suggestion
>>>>>> than a requirement?
>>>>>>
>>>>>> So again, I don't want to hold up this KIP forever but I feel we've
>>> spent
>>>>>> some time getting slightly
>>>>>> off track (although certainly into very interesting discussions) yet
>>> never
>>>>>> really addressed or questioned
>>>>>> the basic premise: *could a user want to specify a number of
>>> partitions but
>>>>>> not enforce a repartition (at that*
>>>>>> *specific point in the topology)?*
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 15, 2019 at 12:18 AM Matthias J. Sax <
>>> matthias@confluent.io>
>>>>>> wrote:
>>>>>>
>>>>>>> Side remark:
>>>>>>>
>>>>>>> If the user specifies `repartition()` on both side of the join, we
>>> can
>>>>>>> actually throw the execption earlier, ie, when we build the topology.
>>>>>>>
>>>>>>> Current, we can do this check only after Kafka Streams was started,
>>>>>>> within `StreamPartitionAssignor#assign()` -- we still need to keep
>>> this
>>>>>>> check for the case that none or only one side has a user specified
>>>>>>> number of partitions though.
>>>>>>>
>>>>>>>
>>>>>>> -Matthias
>>>>>>>
>>>>>>> On 11/14/19 8:15 AM, John Roesler wrote:
>>>>>>>> Thanks, all,
>>>>>>>>
>>>>>>>> I can get behind just totally leaving out reparation-via-groupBy. If
>>>>>>>> we only introduce `repartition()` for now, we're making the minimal
>>>>>>>> change to gain the desired capability.
>>>>>>>>
>>>>>>>> Plus, since we agree that `repartition()` should never be
>>> optimizable,
>>>>>>>> it's a future-compatible proposal. I.e., if we were to add a
>>>>>>>> non-optimizable groupBy(partitions) operation now, and want to make
>>> it
>>>>>>>> optimizable in the future, we have to worry about topology
>>>>>>>> compatibility. Better to just do non-optimizable `repartition()`
>>> now,
>>>>>>>> and add an optimizable `groupBy(partitions)` in the future (maybe).
>>>>>>>>
>>>>>>>> About joins, yes, it's a concern, and IMO we should just do the same
>>>>>>>> thing we do now... check at runtime that the partition counts on
>>> both
>>>>>>>> sides match and throw an exception otherwise. What this means as a
>>>>>>>> user is that if you explicitly repartition the left side to 100
>>>>>>>> partitions, and then join with the right side at 10 partitions, you
>>>>>>>> get an exception, since this operation is not possible. You'd either
>>>>>>>> have to "step down" the left side again, back to 10 partitions, or
>>> you
>>>>>>>> could repartition the right side to 100 partitions before the join.
>>>>>>>> The choice has to be the user's, since it depends on their desired
>>>>>>>> execution parallelism.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -John
>>>>>>>>
>>>>>>>> On Thu, Nov 14, 2019 at 12:55 AM Matthias J. Sax <
>>> matthias@confluent.io>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Thanks a lot John. I think the way you decompose the operators is
>>> super
>>>>>>>>> helpful for this discussion.
>>>>>>>>>
>>>>>>>>> What you suggest with regard to using `Grouped` and enforcing
>>>>>>>>> repartitioning if the number of partitions is specified is
>>> certainly
>>>>>>>>> possible. However, I am not sure if we _should_ do this. My
>>> reasoning is
>>>>>>>>> that an enforce repartitioning as introduced via `repartition()`
>>> is an
>>>>>>>>> expensive operations, and it seems better to demand an more
>>> explicit
>>>>>>>>> user opt-in to trigger it. Just setting an optional parameter
>>> might be
>>>>>>>>> too subtle to trigger such a heavy "side effect".
>>>>>>>>>
>>>>>>>>> While I agree about "usability" in general, I would prefer a more
>>>>>>>>> conservative appraoch to introduce this feature, see how it goes,
>>> and
>>>>>>>>> maybe make it more advance later on. This also applies to what
>>>>>>>>> optimzation we may or may not allow (or are able to perform at
>>> all).
>>>>>>>>>
>>>>>>>>> @Levani: Reflecting about my suggestion about `Repartioned extends
>>>>>>>>> Grouped`, I agree that it might not be a good idea.
>>>>>>>>>
>>>>>>>>> Atm, I see an enforces repartitioning as non-optimizable and as a
>>> good
>>>>>>>>> first step and I would suggest to not intoruce anything else for
>>> now.
>>>>>>>>> Introducing optimizable enforce repartitioning via `groupBy(...,
>>>>>>>>> Grouped)` is something we could add later.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Therefore, I would not change `Grouped` but only introduce
>>>>>>>>> `repartition()`. Users that use `grouBy()` atm, and want to opt-in
>>> to
>>>>>>>>> set the number of partitions, would need to rewrite their code to
>>>>>>>>> `selectKey(...).repartition(...).groupByKey()`. It's less
>>> convinient but
>>>>>>>>> also less risky from an API and optimization point of view.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> @Levani: about joins -> yes, we will need to check the specified
>>> number
>>>>>>>>> of partitions (if any) and if they don't match, throw an
>>> exception. We
>>>>>>>>> can discuss this on the PR -- I am just trying to get the PR for
>>> KIP-466
>>>>>>>>> merged -- your is next on the list :)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Matthias
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 11/12/19 4:51 PM, Levani Kokhreidze wrote:
>>>>>>>>>> Thank you all for an interesting discussion. This is very
>>> enlightening.
>>>>>>>>>>
>>>>>>>>>> Thank you Matthias for your explanation. Your arguments are very
>>> true.
>>>>>>> It makes sense that if user specifies number of partitions he/she
>>> really
>>>>>>> cares that those specifications are applied to internal topics.
>>>>>>>>>> Unfortunately, in current implementation this is not true during
>>>>>>> `join` operation. As I’ve written in the PR comment, currently, when
>>>>>>> `Stream#join` is used, `CopartitionedTopicsEnforcer` chooses max
>>> number of
>>>>>>> partitions from the two source topics.
>>>>>>>>>> I’m not really sure what would be the other way around this
>>> situation.
>>>>>>> Maybe fail the stream altogether and inform the user to specify same
>>> number
>>>>>>> of partitions?
>>>>>>>>>> Or we should treat join operations in a same way as it is right
>>> now
>>>>>>> and basically choose max number of partitions even when `repartition`
>>>>>>> operation is specified, because Kafka Streams “knows the best” how to
>>>>>>> handle joins?
>>>>>>>>>> You can check integration tests how it’s being handled currently.
>>> Open
>>>>>>> to suggestions on that part.
>>>>>>>>>>
>>>>>>>>>> As for groupBy, I agree and John raised very interesting points.
>>> My
>>>>>>> arguments for allowing users to specify number of partitions during
>>> groupBy
>>>>>>> operations mainly was coming from the usability perspective.
>>>>>>>>>> So building on top of what John said, maybe it makes sense to make
>>>>>>> `groupBy` operations smarter and whenever user specifies
>>>>>>> `numberOfPartitions` configuration, repartitioning will be enforced,
>>> wdyt?
>>>>>>>>>> I’m not going into optimization part yet :) I think it will be
>>> part of
>>>>>>> separate PR and task, but overall it makes sense to apply
>>> optimizations
>>>>>>> where number of partitions are the same.
>>>>>>>>>>
>>>>>>>>>> As for Repartitioned extending Grouped, I kinda feel that it
>>> won’t fit
>>>>>>> nicely in current API design.
>>>>>>>>>> In addition, in the PR review, John mentioned that there were a
>>> lot of
>>>>>>> troubles in the past trying to use one operation's configuration
>>> objects on
>>>>>>> other operations.
>>>>>>>>>> Also it makes sense to keep them separate in terms of
>>> compatibility.
>>>>>>>>>> In that case, we don’t have to worry every time Grouped is
>>> changed,
>>>>>>> what would be the implications on `repartition` operations.
>>>>>>>>>>
>>>>>>>>>> Kind regards,
>>>>>>>>>> Levani
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Nov 11, 2019, at 9:13 PM, John Roesler <jo...@confluent.io>
>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Ah, thanks for the clarification. I missed your point.
>>>>>>>>>>>
>>>>>>>>>>> I like the framework you've presented. It does seem simpler to
>>> assume
>>>>>>>>>>> that they either care about the partition count and want to
>>>>>>>>>>> repartition to realize it, or they don't care about the number.
>>>>>>>>>>> Returning to this discussion, it does seem unlikely that they
>>> care
>>>>>>>>>>> about the number and _don't_ care if it actually gets realized.
>>>>>>>>>>>
>>>>>>>>>>> But then, it still seems like we can just keep the option as
>>> part of
>>>>>>>>>>> Grouped. As in:
>>>>>>>>>>>
>>>>>>>>>>> // user does not care
>>>>>>>>>>> stream.groupByKey(Grouped /*not specifying partition count*/)
>>>>>>>>>>> stream.groupBy(Grouped /*not specifying partition count*/)
>>>>>>>>>>>
>>>>>>>>>>> // user does care
>>>>>>>>>>> stream.repartition(Repartitioned)
>>>>>>>>>>> stream.groupByKey(Grouped.numberOfPartitions(...))
>>>>>>>>>>> stream.groupBy(Grouped.numberOfPartitions(...))
>>>>>>>>>>>
>>>>>>>>>>> ----
>>>>>>>>>>>
>>>>>>>>>>> The above discussion got me thinking about algebra. Matthias is
>>>>>>>>>>> absolutely right that `groupByKey(numPartitions)` is equivalent
>>> to
>>>>>>>>>>> `repartition(numPartitions).groupByKey()`. I'm just not
>>> convinced that
>>>>>>>>>>> we should force people to apply that expansion themselves vs.
>>> having a
>>>>>>>>>>> more compact way to express it if they don't care where exactly
>>> the
>>>>>>>>>>> repartition occurs. However, thinking about these operators
>>>>>>>>>>> algebraically can really help *us* narrow down the number of
>>> different
>>>>>>>>>>> expressions we have to consider.
>>>>>>>>>>>
>>>>>>>>>>> Let's consider some identities:
>>>>>>>>>>>
>>>>>>>>>>> A: groupBy(mapper) + agg = mapKey(mapper) + groupByKey + agg
>>>>>>>>>>> B: src + ... + groupByKey + agg = src + ... + passthough + agg
>>>>>>>>>>> C: mapKey(mapper) + ... + groupByKey + agg
>>>>>>>>>>> = mapKey(mapper) + ... + repartition + groupByKey + agg
>>>>>>>>>>> D: repartition = sink(managed) + src
>>>>>>>>>>>
>>>>>>>>>>> In these identities, I used one special identifier (...), which
>>> means
>>>>>>>>>>> any number (0+) of operations that are not src, mapKey,
>>> groupBy[Key],
>>>>>>>>>>> repartition, or agg.
>>>>>>>>>>>
>>>>>>>>>>> For mental clarity, I'm just going to make up a rule that groupBy
>>>>>>>>>>> operations are not executable. In other words, you have to get
>>> to a
>>>>>>>>>>> point where you can apply B to convert a groupByKey into a
>>> passthough
>>>>>>>>>>> in order to execute the program. This is just a formal way of
>>> stating
>>>>>>>>>>> what already happens in Kafka Streams.
>>>>>>>>>>>
>>>>>>>>>>> By applying A, we can just completely leave `groupBy` out of our
>>>>>>>>>>> analysis. It trivially decomposes into a mapKey followed by a
>>>>>>>>>>> groupByKey.
>>>>>>>>>>>
>>>>>>>>>>> Then, we can eliminate the "repartition required" case of
>>> `groupByKey`
>>>>>>>>>>> by applying C followed by D to get to the "no repartition
>>> required"
>>>>>>>>>>> version of groupByKey, which in turn sets us up to apply B to
>>> get an
>>>>>>>>>>> executable topology.
>>>>>>>>>>>
>>>>>>>>>>> Fundamentally, you can think about KIP-221 is as proposing a
>>> modified
>>>>>>>>>>> D identity in which you can specify the partition count of the
>>> managed
>>>>>>>>>>> sink topic:
>>>>>>>>>>> D': repartition(pc) = sink(managed w/ pc) + src
>>>>>>>>>>>
>>>>>>>>>>> Since users _could_ apply the identities above, we don't
>>> actually have
>>>>>>>>>>> to add any partition count to groupBy[Key], but we decided early
>>> on in
>>>>>>>>>>> the KIP discussion that it's more ergonomic to add it. In that
>>> case,
>>>>>>>>>>> we also have to modify A and C:
>>>>>>>>>>> A': groupBy(mapper, pc) + agg
>>>>>>>>>>> = mapKey(mapper) + groupByKey(pc) + agg
>>>>>>>>>>> C': mapKey(mapper) + ... + groupByKey(pc) + agg
>>>>>>>>>>> = mapKey(mapper) + ... + repartition(pc) + groupByKey + agg
>>>>>>>>>>>
>>>>>>>>>>> Which sets us up still to always be able to get back to a plain
>>>>>>>>>>> `groupByKey` operation (with no `(pc)`) and then apply D' and
>>>>>>>>>>> ultimately B to get an executable topology.
>>>>>>>>>>>
>>>>>>>>>>> What about the optimizer?
>>>>>>>>>>> The optimizer applies another set of graph-algebraic identities
>>> to
>>>>>>>>>>> minimize the number of repartition topics in a topology.
>>>>>>>>>>>
>>>>>>>>>>> (forgive my ascii art)
>>>>>>>>>>>
>>>>>>>>>>> E: (merging repartition nodes)
>>>>>>>>>>> (...) -> repartition -> X
>>>>>>>>>>> \-> repartition -> Y
>>>>>>>>>>> =
>>>>>>>>>>> (... + repartition) -> X
>>>>>>>>>>>    \-> Y
>>>>>>>>>>> F: (reordering around repartition)
>>>>>>>>>>> Where SVO is any non-key-changing, stateless, operation:
>>>>>>>>>>> repartition -> SVO = SVO -> repartition
>>>>>>>>>>>
>>>>>>>>>>> In terms of these identities, what the optimizer does is apply F
>>>>>>>>>>> repeatedly in either direction to a topology to factor out
>>> common in
>>>>>>>>>>> branches so that it can apply E to merge repartition nodes. This
>>> was
>>>>>>>>>>> especially necessary before KIP-221 because you couldn't directly
>>>>>>>>>>> express `repartition` in the DSL, only indirectly via
>>> `groupBy[Key]`,
>>>>>>>>>>> so there was no way to do the factoring manually.
>>>>>>>>>>>
>>>>>>>>>>> We can now state very clearly that in KIP-221, explicit
>>>>>>>>>>> `repartition()` operators should create a "reordering barrier".
>>> So, F
>>>>>>>>>>> cannot be applied to an explicit `repartition()`. Also, I think
>>> we
>>>>>>>>>>> decided earlier that explicit `repartition()` operations would
>>> also be
>>>>>>>>>>> ineligible for merging, so E can't be applied to explicit
>>>>>>>>>>> `repartition()` operations either. I think we feel we _could_
>>> apply E
>>>>>>>>>>> without harm, but we want to be conservative for now.
>>>>>>>>>>>
>>>>>>>>>>> I think the salient point from the latter discussion has been
>>> that
>>>>>>>>>>> when you use `Grouped.numberOfPartitions`, this does _not_
>>> constitute
>>>>>>>>>>> an explicit `repartition()` operator, and therefore, the
>>> resulting
>>>>>>>>>>> repartition node remains eligible for optimization.
>>>>>>>>>>>
>>>>>>>>>>> To be clear, I agree with Matthias that the provided partition
>>> count
>>>>>>>>>>> _must_ be used in the resulting implicit repartition. This has
>>> some
>>>>>>>>>>> implications for E. Namely, E could only be applied to two
>>> repartition
>>>>>>>>>>> nodes that have the same partition count. This has always been
>>>>>>>>>>> trivially true before KIP-221 because the partition count has
>>> always
>>>>>>>>>>> been "unspecified", i.e., it would be determined at runtime by
>>> the
>>>>>>>>>>> user-managed-topics' partition counts. Now, it could be
>>> specified or
>>>>>>>>>>> unspecified. We can simply augment E to allow merging only
>>> repartition
>>>>>>>>>>> nodes where the partition count is EITHER "specified and the
>>> same on
>>>>>>>>>>> both sides", OR "unspecified on both sides".
>>>>>>>>>>>
>>>>>>>>>>> Sorry for the long email, but I have a hope that it builds a
>>> solid
>>>>>>>>>>> theoretical foundation for our decisions in KIP-221, so we can
>>> have
>>>>>>>>>>> confidence that there are no edge cases for design flaws to hide.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> -John
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Nov 9, 2019 at 10:37 PM Matthias J. Sax <
>>>>>>> matthias@confluent.io> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> it seems like we do want to allow
>>>>>>>>>>>>>> people to optionally specify a partition count as part of this
>>>>>>>>>>>>>> operation, but we don't want that option to _force_
>>> repartitioning
>>>>>>>>>>>>
>>>>>>>>>>>> Correct, ie, that is my suggestions.
>>>>>>>>>>>>
>>>>>>>>>>>>> "Use P partitions if repartitioning is necessary"
>>>>>>>>>>>>
>>>>>>>>>>>> I disagree here, because my reasoning is that:
>>>>>>>>>>>>
>>>>>>>>>>>> - if a user cares about the number of partition, the user wants
>>> to
>>>>>>>>>>>> enforce a repartitioning
>>>>>>>>>>>> - if a user does not case about the number of partitions, we
>>> don't
>>>>>>> need
>>>>>>>>>>>> to provide them a way to pass in a "hint"
>>>>>>>>>>>>
>>>>>>>>>>>> Hence, it should be sufficient to support:
>>>>>>>>>>>>
>>>>>>>>>>>> // user does not care
>>>>>>>>>>>>
>>>>>>>>>>>> `stream.groupByKey(Grouped)`
>>>>>>>>>>>> `stream.grouBy(..., Grouped)`
>>>>>>>>>>>>
>>>>>>>>>>>> // user does care
>>>>>>>>>>>>
>>>>>>>>>>>> `stream.repartition(Repartitioned).groupByKey()`
>>>>>>>>>>>> `streams.groupBy(..., Repartitioned)`
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/9/19 8:10 PM, John Roesler wrote:
>>>>>>>>>>>>> Thanks for those thoughts, Matthias,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I find your reasoning about the optimization behavior
>>> compelling.
>>>>>>> The
>>>>>>>>>>>>> `through` operation is very simple and clear to reason about.
>>> It
>>>>>>> just
>>>>>>>>>>>>> passes the data exactly at the specified point in the topology
>>>>>>> exactly
>>>>>>>>>>>>> through the specified topic. Likewise, if a user invokes a
>>>>>>>>>>>>> `repartition` operator, the simplest behavior is if we just do
>>> what
>>>>>>>>>>>>> they asked for.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Stepping back to think about when optimizations are surprising
>>> and
>>>>>>>>>>>>> when they aren't, it occurs to me that we should be free to
>>> move
>>>>>>>>>>>>> around repartitions when users have asked to perform some
>>> operation
>>>>>>>>>>>>> that implies a repartition, like "change keys, then filter,
>>> then
>>>>>>>>>>>>> aggregate". This program requires a repartition, but it could
>>> be
>>>>>>>>>>>>> anywhere between the key change and the aggregation. On the
>>> other
>>>>>>>>>>>>> hand, if they say, "change keys, then filter, then
>>> repartition, then
>>>>>>>>>>>>> aggregate", it seems like they were pretty clear about their
>>> desire,
>>>>>>>>>>>>> and we should just take it at face value.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So, I'm sold on just literally doing a repartition every time
>>> they
>>>>>>>>>>>>> invoke the `repartition` operator.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The "partition count" modifier for `groupBy`/`groupByKey` is
>>> more
>>>>>>> nuanced.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What you said about `groupByKey` makes sense. If they key
>>> hasn't
>>>>>>>>>>>>> actually changed, then we don't need to repartition before
>>>>>>>>>>>>> aggregating. On the other hand, `groupBy` is specifically
>>> changing
>>>>>>> the
>>>>>>>>>>>>> key as part of the grouping operation, so (as you said) we
>>>>>>> definitely
>>>>>>>>>>>>> have to do a repartition.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If I'm reading the discussion right, it seems like we do want
>>> to
>>>>>>> allow
>>>>>>>>>>>>> people to optionally specify a partition count as part of this
>>>>>>>>>>>>> operation, but we don't want that option to _force_
>>> repartitioning
>>>>>>> if
>>>>>>>>>>>>> it's not needed. That last clause is the key. "Use P
>>> partitions if
>>>>>>>>>>>>> repartitioning is necessary" is a directive that applies
>>> cleanly and
>>>>>>>>>>>>> correctly to both `groupBy` and `groupByKey`. What if we call
>>> the
>>>>>>>>>>>>> option `numberOfPartitionsHint`, which along with the "if
>>> necessary"
>>>>>>>>>>>>> javadoc, should make it clear that the option won't force a
>>>>>>>>>>>>> repartition, and also gives us enough latitude to still employ
>>> the
>>>>>>>>>>>>> optimizer on those repartition topics?
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we like the idea of expressing it as a "hint" for grouping
>>> and a
>>>>>>>>>>>>> "command" for `repartition`, then it seems like it still makes
>>> sense
>>>>>>>>>>>>> to keep Grouped and Repartitioned separate, as they would
>>> actually
>>>>>>>>>>>>> offer different methods with distinct semantics.
>>>>>>>>>>>>>
>>>>>>>>>>>>> WDYT?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> -John
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <
>>>>>>> matthias@confluent.io> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sorry for late reply.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I guess, the question boils down to the intended semantics of
>>>>>>>>>>>>>> `repartition()`. My understanding is as follows:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - KS does auto-repartitioning for correctness reasons (using
>>> the
>>>>>>>>>>>>>> upstream topic to determine the number of partitions)
>>>>>>>>>>>>>> - KS does auto-repartitioning only for downstream DSL
>>> operators
>>>>>>> like
>>>>>>>>>>>>>> `count()` (eg, a `transform()` does never trigger an
>>>>>>> auto-repartitioning
>>>>>>>>>>>>>> even if the stream is marked as `repartitioningRequired`).
>>>>>>>>>>>>>> - KS offers `through()` to enforce a repartitioning --
>>> however,
>>>>>>> the user
>>>>>>>>>>>>>> needs to create the topic manually (with the desired number of
>>>>>>> partitions).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see two main applications for `repartitioning()`:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1) repartition data before a `transform()` but user does not
>>> want
>>>>>>> to
>>>>>>>>>>>>>> manage the topic
>>>>>>>>>>>>>> 2) scale out a downstream subtopology
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hence, I see `repartition()` similar to `through()`: if a
>>> users
>>>>>>> calls
>>>>>>>>>>>>>> it, a repartitining is enforced, with the difference that KS
>>>>>>> manages the
>>>>>>>>>>>>>> topic and the user does not need to create it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This behavior makes (1) and (2) possible.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think many users would prefer to just say "if there *is* a
>>>>>>> repartition
>>>>>>>>>>>>>>> required at this point in the topology, it should
>>>>>>>>>>>>>>> have N partitions"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Because of (2), I disagree. Either a user does not care about
>>>>>>> scaling
>>>>>>>>>>>>>> out, for which case she would not specify the number of
>>>>>>> partitions. Or a
>>>>>>>>>>>>>> user does care, and hence wants to enforce the scale out. I
>>> don't
>>>>>>> think
>>>>>>>>>>>>>> that any user would say, "maybe scale out".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Therefore, the optimizer should never ignore the repartition
>>>>>>> operation.
>>>>>>>>>>>>>> As a "consequence" (because repartitioning is expensive) a
>>> user
>>>>>>> should
>>>>>>>>>>>>>> make an explicit call to `repartition()` IMHO -- piggybacking
>>> an
>>>>>>>>>>>>>> enforced repartitioning into `groupByKey()` seems to be
>>> "dangerous"
>>>>>>>>>>>>>> because it might be too subtle and an "optional scaling out"
>>> as
>>>>>>> laid out
>>>>>>>>>>>>>> above does not make sense IMHO.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am also not worried about "over repartitioning" because the
>>>>>>> result
>>>>>>>>>>>>>> stream would never trigger auto-repartitioning. Only if
>>> multiple
>>>>>>>>>>>>>> consecutive calls to `repartition()` are made it could be bad
>>> --
>>>>>>> but
>>>>>>>>>>>>>> that's the same with `through()`. In the end, there is always
>>> some
>>>>>>>>>>>>>> responsibility on the user.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Btw, for `.groupBy()` we know that repartitioning will be
>>> required,
>>>>>>>>>>>>>> however, for `groupByKey()` it depends if the KStream is
>>> marked as
>>>>>>>>>>>>>> `repartitioningRequired`.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hence, for `groupByKey()` it should not be possible for a
>>> user to
>>>>>>> set
>>>>>>>>>>>>>> number of partitions IMHO. For `groupBy()` it's a different
>>> story,
>>>>>>>>>>>>>> because calling
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  `repartition().groupBy()`
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> does not achieve what we want. Hence, allowing users to pass
>>> in the
>>>>>>>>>>>>>> number of users partitions into `groupBy()` does actually
>>> makes
>>>>>>> sense,
>>>>>>>>>>>>>> because repartitioning will happen anyway and thus we can
>>>>>>> piggyback a
>>>>>>>>>>>>>> scaling decision.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think that John has a fair concern about the overloads,
>>> however,
>>>>>>> I am
>>>>>>>>>>>>>> not convinced that using `Grouped` to specify the number of
>>>>>>> partitions
>>>>>>>>>>>>>> is intuitive. I double checked `Grouped` and `Repartitioned`
>>> and
>>>>>>> both
>>>>>>>>>>>>>> allow to specify a `name` and `keySerde/valueSerde`. Thus, I
>>> am
>>>>>>>>>>>>>> wondering if we could bridge the gap between both, if we
>>> would make
>>>>>>>>>>>>>> `Repartitioned extends Grouped`? For this case, we only need
>>>>>>>>>>>>>> `groupBy(Grouped)` and a user can pass in both types what
>>> seems to
>>>>>>> make
>>>>>>>>>>>>>> the API quite smooth:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> `stream.groupBy(..., Grouped...)`
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> `stream.groupBy(..., Repartitioned...)`
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>> Hi Sophie,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you for your reply, very insightful. Looking forward
>>>>>>> hearing others opinion as well on this.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <
>>>>>>> sophie@confluent.io> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
>>>>>>> other hand
>>>>>>>>>>>>>>>> Kafka Streams has already
>>>>>>>>>>>>>>>>> optimizer in place which alters topology independently
>>> from user
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree (with you) and think this is a good way to put it
>>> -- we
>>>>>>> currently
>>>>>>>>>>>>>>>> auto-repartition for the user so
>>>>>>>>>>>>>>>> that they don't have to walk through their entire topology
>>> and
>>>>>>> reason about
>>>>>>>>>>>>>>>> when and where to place a
>>>>>>>>>>>>>>>> `.through` (or the new `.repartition`), so why suddenly
>>> force
>>>>>>> this onto the
>>>>>>>>>>>>>>>> user? How certain are we that
>>>>>>>>>>>>>>>> users will always get this right? It's easy to imagine that
>>>>>>> during
>>>>>>>>>>>>>>>> development, you write your new app with
>>>>>>>>>>>>>>>> correctly placed repartitions in order to use this new
>>> feature.
>>>>>>> During the
>>>>>>>>>>>>>>>> course of development you end up
>>>>>>>>>>>>>>>> tweaking the topology, but don't remember to review or move
>>> the
>>>>>>>>>>>>>>>> repartitioning since you're used to Streams
>>>>>>>>>>>>>>>> doing this for you. If you use only single-partition topics
>>> for
>>>>>>> testing,
>>>>>>>>>>>>>>>> you might not even notice your app is
>>>>>>>>>>>>>>>> spitting out incorrect results!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Anyways, I feel pretty strongly that it would be weird to
>>>>>>> introduce a new
>>>>>>>>>>>>>>>> feature and say that to use it, you can't take
>>>>>>>>>>>>>>>> advantage of this other feature anymore. Also, is it
>>> possible our
>>>>>>>>>>>>>>>> optimization framework could ever include an
>>>>>>>>>>>>>>>> optimized repartitioning strategy that is better than what a
>>>>>>> user could
>>>>>>>>>>>>>>>> achieve by manually inserting repartitions?
>>>>>>>>>>>>>>>> Do we expect users to have a deep understanding of the best
>>> way
>>>>>>> to
>>>>>>>>>>>>>>>> repartition their particular topology, or is it
>>>>>>>>>>>>>>>> likely they will end up over-repartitioning either due to
>>> missed
>>>>>>>>>>>>>>>> optimizations or unnecessary extra repartitions?
>>>>>>>>>>>>>>>> I think many users would prefer to just say "if there *is* a
>>>>>>> repartition
>>>>>>>>>>>>>>>> required at this point in the topology, it should
>>>>>>>>>>>>>>>> have N partitions"
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As to the idea of adding `numberOfPartitions` to Grouped
>>> rather
>>>>>>> than
>>>>>>>>>>>>>>>> adding a new parameter to groupBy, that does seem more in
>>> line
>>>>>>> with the
>>>>>>>>>>>>>>>> current syntax so +1 from me
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <
>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> While https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> is under
>>> review and
>>>>>>> it’s
>>>>>>>>>>>>>>>>> almost done, I want to resurrect discussion about this KIP
>>> to
>>>>>>> address
>>>>>>>>>>>>>>>>> couple of concerns raised by Matthias and John.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> As a reminder, idea of the KIP-221 was to allow DSL users
>>>>>>> control over
>>>>>>>>>>>>>>>>> repartitioning and parallelism of sub-topologies by:
>>>>>>>>>>>>>>>>> 1) Introducing new KStream#repartition operation which is
>>> done
>>>>>>> in
>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>
>>>>>>>>>>>>>>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which
>>> is
>>>>>>> planned to
>>>>>>>>>>>>>>>>> be separate PR.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> While all agree about general implementation and idea
>>> behind
>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> PR,
>>> introducing new
>>>>>>>>>>>>>>>>> KStream#groupBy(Repartitioned) method overload raised some
>>>>>>> questions during
>>>>>>>>>>>>>>>>> the review.
>>>>>>>>>>>>>>>>> Matthias raised concern that there can be cases when user
>>> uses
>>>>>>>>>>>>>>>>> `KStream#groupBy(Repartitioned)` operation, but actual
>>>>>>> repartitioning may
>>>>>>>>>>>>>>>>> not required, thus configuration passed via `Repartitioned`
>>>>>>> would never be
>>>>>>>>>>>>>>>>> applied (Matthias, please correct me if I misinterpreted
>>> your
>>>>>>> comment).
>>>>>>>>>>>>>>>>> So instead, if user wants to control parallelism of
>>>>>>> sub-topologies, he or
>>>>>>>>>>>>>>>>> she should always use `KStream#repartition` operation
>>> before
>>>>>>> groupBy. Full
>>>>>>>>>>>>>>>>> comment can be seen here:
>>>>>>>>>>>>>>>>>
>>>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
>>>>>>>>>>>>>>>>>
>>>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On the same topic, John pointed out that, from API design
>>>>>>> perspective, we
>>>>>>>>>>>>>>>>> shouldn’t intertwine configuration classes of different
>>>>>>> operators between
>>>>>>>>>>>>>>>>> one another. So instead of introducing new
>>>>>>> `KStream#groupBy(Repartitioned)`
>>>>>>>>>>>>>>>>> for specifying number of partitions for internal topic, we
>>>>>>> should update
>>>>>>>>>>>>>>>>> existing `Grouped` class with `numberOfPartitions` field.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
>>>>>>> other hand
>>>>>>>>>>>>>>>>> Kafka Streams has already optimizer in place which alters
>>>>>>> topology
>>>>>>>>>>>>>>>>> independently from user. So maybe it makes sense if Kafka
>>>>>>> Streams,
>>>>>>>>>>>>>>>>> internally would optimize topology in the best way
>>> possible,
>>>>>>> even if in
>>>>>>>>>>>>>>>>> some cases this means ignoring some operator configurations
>>>>>>> passed by the
>>>>>>>>>>>>>>>>> user. Also, I agree with John about API design semantics.
>>> If we
>>>>>>> go through
>>>>>>>>>>>>>>>>> with the changes for `KStream#groupBy` operation, it makes
>>> more
>>>>>>> sense to
>>>>>>>>>>>>>>>>> add `numberOfPartitions` field to `Grouped` class instead
>>> of
>>>>>>> introducing
>>>>>>>>>>>>>>>>> new `KStream#groupBy(Repartitioned)` method overload.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I would really appreciate communities feedback on this.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <
>>>>>>> sophie@confluent.io>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I think people are busy with the upcoming 2.4 release, and
>>>>>>> don't have
>>>>>>>>>>>>>>>>> much
>>>>>>>>>>>>>>>>>> spare time at the
>>>>>>>>>>>>>>>>>> moment. It's kind of a difficult time to get attention on
>>>>>>> things, but
>>>>>>>>>>>>>>>>> feel
>>>>>>>>>>>>>>>>>> free to pick up something else
>>>>>>>>>>>>>>>>>> to work on in the meantime until things have calmed down
>>> a bit!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Sorry for bringing this thread again, but I would like
>>> to get
>>>>>>> some
>>>>>>>>>>>>>>>>>>> attention on this PR:
>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> <
>>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>>
>>>>>>>>>>>>>>>>>>> It's been a while now and I would love to move on to
>>> other
>>>>>>> KIPs as well.
>>>>>>>>>>>>>>>>>>> Please let me know if you have any concerns.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Here’s voting thread for this KIP:
>>>>>>>>>>>>>>>>>>>
>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>>>>>>>>>>>>>>
>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>
>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>>>>>>>>>>>>>>
>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html
>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>> levani.codes@gmail.com>>>
>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion
>>> on
>>>>>>> that one.
>>>>>>>>>>>>>>>>>>>>> Agree that avoiding unnecessary method overloads is a
>>> good
>>>>>>> idea.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Updated KIP
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <
>>>>>>> matthias@confluent.io
>>>>>>>>>>>>>>>>> <ma...@confluent.io>
>>>>>>>>>>>>>>>>>>> <mailto:matthias@confluent.io <mailto:
>>> matthias@confluent.io>>>
>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> One question:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Why do we add
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Repartitioned#with(final String name, final int
>>>>>>> numberOfPartitions)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> It seems that `#with(String name)`,
>>>>>>> `#numberOfPartitions(int)` in
>>>>>>>>>>>>>>>>>>>>>> combination with `withName()` and
>>>>>>> `withNumberOfPartitions()` should
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>> sufficient. Users can chain the method calls.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> (I think it's valuable to keep the number of overload
>>>>>>> small if
>>>>>>>>>>>>>>>>>>> possible.)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Otherwise LGTM.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks all for your feedback.
>>>>>>>>>>>>>>>>>>>>>>> I started voting procedure for this KIP. If there’re
>>> any
>>>>>>> other
>>>>>>>>>>>>>>>>>>> concerns about this KIP, please let me know.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>> <mailto:
>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>
>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, makes sense.
>>>>>>>>>>>>>>>>>>>>>>>> I’ve updated KIP (
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> ).
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
>>>>>>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>
>>>>>>>>>>>>>>>>>>> <mailto:matthias@confluent.io <mailto:
>>> matthias@confluent.io>>
>>>>>>> <mailto:
>>>>>>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>
>>> <mailto:
>>>>>>>>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>>>>
>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving the KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I agree that users need to be able to specify a
>>>>>>> partitioning
>>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Sophie raises a fair point about topic configs and
>>>>>>> producer
>>>>>>>>>>>>>>>>>>> configs. My
>>>>>>>>>>>>>>>>>>>>>>>>> take is, that consider `Repartitioned` as an
>>>>>>> "extension" to
>>>>>>>>>>>>>>>>>>> `Produced`,
>>>>>>>>>>>>>>>>>>>>>>>>> that adds topic configuration, is a good way to
>>> think
>>>>>>> about it and
>>>>>>>>>>>>>>>>>>> helps
>>>>>>>>>>>>>>>>>>>>>>>>> to keep the API "clean".
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> With regard to method names. I would prefer to
>>> avoid
>>>>>>>>>>>>>>>>> abbreviations.
>>>>>>>>>>>>>>>>>>> Can
>>>>>>>>>>>>>>>>>>>>>>>>> we rename:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Furthermore, it might be good to add some more
>>> `static`
>>>>>>> methods:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>>>>>>>>>>>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>>>>>>>>>>>>>>>>>>>>>> -
>>> Repartitioned.streamPartitioner(StreamPartitioner)
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> Totally agree. I think in KStream interface it
>>> makes
>>>>>>> sense to
>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>> some duplicate configurations between operators in order
>>> to
>>>>>>> keep API
>>>>>>>>>>>>>>>>> simple
>>>>>>>>>>>>>>>>>>> and usable.
>>>>>>>>>>>>>>>>>>>>>>>>>> Also, as more surface API has, harder it is to
>>> have
>>>>>>> proper
>>>>>>>>>>>>>>>>>>> backward compatibility.
>>>>>>>>>>>>>>>>>>>>>>>>>> While initial idea of keeping topic level configs
>>>>>>> separate was
>>>>>>>>>>>>>>>>>>> exciting, having Repartitioned class encapsulate some
>>>>>>> producer level
>>>>>>>>>>>>>>>>>>> configs makes API more readable.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman
>>> <
>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
>>> <mailto:
>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
>>> <mailto:
>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that is a good point about trying to keep
>>>>>>> producer level
>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations and (repartition) topic level
>>>>>>> considerations
>>>>>>>>>>>>>>>>>>> separate.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Number of partitions is definitely purely a topic
>>>>>>> level
>>>>>>>>>>>>>>>>>>> configuration. But
>>>>>>>>>>>>>>>>>>>>>>>>>>> on some level, serdes and partitioners are just
>>> as
>>>>>>> much a topic
>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration as a producer one. You could have
>>> two
>>>>>>> producers
>>>>>>>>>>>>>>>>>>> configured
>>>>>>>>>>>>>>>>>>>>>>>>>>> with different serdes and/or partitioners, but if
>>>>>>> they are
>>>>>>>>>>>>>>>>>>> writing to the
>>>>>>>>>>>>>>>>>>>>>>>>>>> same topic the result would be very difficult to
>>>>>>> part. So in a
>>>>>>>>>>>>>>>>>>> sense, these
>>>>>>>>>>>>>>>>>>>>>>>>>>> are configurations of topics in Streams, not just
>>>>>>> producers.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Another way to think of it: while the Streams
>>> API is
>>>>>>> not always
>>>>>>>>>>>>>>>>>>> true to
>>>>>>>>>>>>>>>>>>>>>>>>>>> this, ideally all the relevant configs for an
>>>>>>> operator are
>>>>>>>>>>>>>>>>>>> wrapped into a
>>>>>>>>>>>>>>>>>>>>>>>>>>> single object (in this case, Repartitioned). We
>>> could
>>>>>>> instead
>>>>>>>>>>>>>>>>>>> split out the
>>>>>>>>>>>>>>>>>>>>>>>>>>> fields in common with Produced into a separate
>>>>>>> parameter to keep
>>>>>>>>>>>>>>>>>>> topic and
>>>>>>>>>>>>>>>>>>>>>>>>>>> producer level configurations separate, but this
>>>>>>> increases the
>>>>>>>>>>>>>>>>>>> API surface
>>>>>>>>>>>>>>>>>>>>>>>>>>> area by a lot. It's much more straightforward to
>>> just
>>>>>>> say "this
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>> everything that this particular operator needs"
>>>>>>> without worrying
>>>>>>>>>>>>>>>>>>> about what
>>>>>>>>>>>>>>>>>>>>>>>>>>> exactly you're specifying.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I suppose you could alternatively make Produced a
>>>>>>> field of
>>>>>>>>>>>>>>>>>>> Repartitioned,
>>>>>>>>>>>>>>>>>>>>>>>>>>> but I don't think we do this kind of composition
>>>>>>> elsewhere in
>>>>>>>>>>>>>>>>>>> Streams at
>>>>>>>>>>>>>>>>>>>>>>>>>>> the moment
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani
>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>> <mailto:
>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>> <mailto:
>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Bill,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
>>>>>>>>>>>>>>>>>>> `Repartitioned#partitioner`
>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class
>>> for
>>>>>>> topic level
>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration and keep topic level and producer
>>> level
>>>>>>>>>>>>>>>>>>> configurations (such
>>>>>>>>>>>>>>>>>>>>>>>>>>>> as Produced) separately (see my second email in
>>> this
>>>>>>> thread).
>>>>>>>>>>>>>>>>>>>>>>>>>>>> But while looking at the semantics of KStream
>>>>>>> interface, I
>>>>>>>>>>>>>>>>>>> couldn’t really
>>>>>>>>>>>>>>>>>>>>>>>>>>>> figure out good operation name for Topic level
>>>>>>> configuration
>>>>>>>>>>>>>>>>>>> class and just
>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing `Topic` config class was kinda
>>> breaking
>>>>>>> the
>>>>>>>>>>>>>>>>>>> semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> So I think having Repartitioned class which
>>>>>>> encapsulates topic
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> producer level configurations for internal
>>> topics is
>>>>>>> viable
>>>>>>>>>>>>>>>>>>> thing to do.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <
>>>>>>> bbejeck@gmail.com
>>>>>>>>>>>>>>>>> <ma...@gmail.com>
>>>>>>>>>>>>>>>>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>>
>>>>>>> <mailto:
>>>>>>>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>>>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Lavani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.
>>> In
>>>>>>> addition to
>>>>>>>>>>>>>>>>>>> the reason
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Users may want to use something other than
>>>>>>> hash-based
>>>>>>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Users may wish to partition on something
>>>>>>> different than the
>>>>>>>>>>>>>>>>>>> key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. A combination of fields in the value in
>>>>>>> conjunction with
>>>>>>>>>>>>>>>>>>> the key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Something other than the key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on
>>>>>>> Produced hence
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes
>>> sense
>>>>>>> for API
>>>>>>>>>>>>>>>>>>> consistency.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just my  2 cents.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani
>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>> levani.codes@gmail.com>>
>>>>>>> <mailto:
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>> <mailto:
>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we add partitioner configuration to
>>>>>>> Repartitioned class,
>>>>>>>>>>>>>>>>>>> with the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> combination of specifying number of
>>> partitions for
>>>>>>> internal
>>>>>>>>>>>>>>>>>>> topics, user
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will have opportunity to ensure
>>> co-partitioning
>>>>>>> before join
>>>>>>>>>>>>>>>>>>> operation.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
>>>>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io>
>>>>>>>>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>
>>>>>>> <mailto:
>>>>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind.
>>> Again,
>>>>>>> not totally
>>>>>>>>>>>>>>>>>>> sure it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar
>>> is
>>>>>>> the
>>>>>>>>>>>>>>>>> rationale
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani
>>> Kokhreidze
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>> levani.codes@gmail.com>>
>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>> levani.codes@gmail.com>
>>>>>>> <mailto:
>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey John,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your
>>> example
>>>>>>> I would
>>>>>>>>>>>>>>>>>>> first issue
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper
>>> partitioner
>>>>>>> that
>>>>>>>>>>>>>>>>>>> essentially
>>>>>>>>>>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be the same as the topic I want to join with
>>> and
>>>>>>> then do the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#join
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with DSL?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler
>>> <
>>>>>>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
>>>>>>> john@confluent.io
>>>>>>>>>>>>>>>>> <ma...@confluent.io>> <mailto:john@confluent.io
>>> <mailto:
>>>>>>>>>>>>>>>>> john@confluent.io>
>>>>>>>>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an
>>> option to
>>>>>>> specify
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is
>>> that
>>>>>>> some data may
>>>>>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input
>>>>>>> topic. If the
>>>>>>>>>>>>>>>>>>> right-side
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning
>>>>>>> strategy, then the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be
>>>>>>> partitioned with the
>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss
>>>>>>> something
>>>>>>>>>>>>>>>>>>> important?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani
>>>>>>> Kokhreidze
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>> levani.codes@gmail.com>>
>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>> levani.codes@gmail.com>
>>>>>>> <mailto:
>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To
>>> be
>>>>>>> honest I’m
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really
>>>>>>> think I would
>>>>>>>>>>>>>>>>>>> need control
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka
>>> Streams
>>>>>>> knows best
>>>>>>>>>>>>>>>>>>> how to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should
>>> be
>>>>>>> used only for
>>>>>>>>>>>>>>>>>>> topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have
>>>>>>> possibility to
>>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet,
>>> but I
>>>>>>> guess
>>>>>>>>>>>>>>>>>>> specifying the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s
>>>>>>> agreement on
>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie
>>>>>>> Blee-Goldman <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <mailto:
>>> sophie@confluent.io>
>>>>>>> <mailto:
>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that
>>>>>>> Repartitioned
>>>>>>>>>>>>>>>>>>> would be a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also
>>> need
>>>>>>> to have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field,
>>> similar
>>>>>>> to
>>>>>>>>>>>>>>>>>>> Produced? I'm not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sure how
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but
>>> seems
>>>>>>> it should
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani
>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>> levani.codes@gmail.com>
>>>>>>> <mailto:
>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by
>>> Kafka
>>>>>>> Streams.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user
>>> more
>>>>>>> control over
>>>>>>>>>>>>>>>>>>> the topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is
>>>>>>> something that
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> missing
>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over
>>>>>>> parallelism by
>>>>>>>>>>>>>>>>>>> configuring
>>>>>>>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie
>>>>>>> Blee-Goldman <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <mailto:
>>> sophie@confluent.io>
>>>>>>> <mailto:
>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one
>>>>>>> thing for me
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> for the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a
>>>>>>> Repartitioned,
>>>>>>>>>>>>>>>>>>> will the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems
>>> to be
>>>>>>> the case
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> signature
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it
>>> have to
>>>>>>> be
>>>>>>>>>>>>>>>>>>> pre-created? The
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wording
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version
>>> of
>>>>>>> the method
>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani
>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
>>> levani.codes@gmail.com>
>>>>>>> <mailto:
>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing
>>> list :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities
>>>>>>> opinions/concerns
>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>> this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani
>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani
>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>>
>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward,
>>> I’ve
>>>>>>> updated
>>>>>>>>>>>>>>>>>>> confluence
>>>>>>>>>>>>>>>>>>>>>>>>>>>> page
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected
>>> Alternatives”
>>>>>>> section.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP
>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani
>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>>
>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing
>>>>>>> dedicated `Topic`
>>>>>>>>>>>>>>>>>>> class for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators
>>> like
>>>>>>>>>>>>>>>>> `groupedBy`.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others
>>> opinion
>>>>>>> about this
>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM,
>>> Matthias
>>>>>>> J. Sax <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> matthias@confluent.io
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>>
>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP!
>>> And
>>>>>>> thanks for
>>>>>>>>>>>>>>>>>>> summarizing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been
>>>>>>> discussed
>>>>>>>>>>>>>>>>>>> already (can't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a
>>> good
>>>>>>> summary to
>>>>>>>>>>>>>>>>>>> refresh the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes
>>> sense.
>>>>>>> With regard
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage
>>> topics
>>>>>>> and
>>>>>>>>>>>>>>>>> operators
>>>>>>>>>>>>>>>>>>> that use
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument,
>>> it
>>>>>>> might
>>>>>>>>>>>>>>>>> indicate
>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator
>>> that
>>>>>>> uses
>>>>>>>>>>>>>>>>>>> use-defined
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()`
>>>>>>> operator (an
>>>>>>>>>>>>>>>>>>> operator that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good.
>>>>>>> Otherwise, there is
>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages
>>>>>>> topics but
>>>>>>>>>>>>>>>>>>> sometimes
>>>>>>>>>>>>>>>>>>>>>>>>>>>> not; a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make
>>> the
>>>>>>> distinction
>>>>>>>>>>>>>>>>>>> clearer.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to
>>>>>>> `Grouped`. I am
>>>>>>>>>>>>>>>>>>> wondering
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does
>>> apply
>>>>>>> and adding
>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get
>>>>>>> opinions of
>>>>>>>>>>>>>>>>>>> others on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I
>>> prefer
>>>>>>> atm.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration
>>> objects
>>>>>>> that allow
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> configure
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer,
>>> producer,
>>>>>>> store. If we
>>>>>>>>>>>>>>>>>>> assume that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if
>>> we
>>>>>>> should have a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()`
>>> class
>>>>>>> and method
>>>>>>>>>>>>>>>>>>> instead of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to
>>> add
>>>>>>> other
>>>>>>>>>>>>>>>>> configs,
>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc,
>>> easily,
>>>>>>> without the
>>>>>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not
>>> sure
>>>>>>> if I like
>>>>>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>>>>>>>>>>>>> myself.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani
>>> Kokhreidze
>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though -
>>>>>>> maybe
>>>>>>>>>>>>>>>>> enhancing
>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the
>>>>>>> best approach.
>>>>>>>>>>>>>>>>>>> Let me
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class
>>> with
>>>>>>> this
>>>>>>>>>>>>>>>>>>> configuration,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation.
>>> Since
>>>>>>> KStream#to is
>>>>>>>>>>>>>>>>>>> the final
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sink of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be
>>> reasonable
>>>>>>>>>>>>>>>>> assumption
>>>>>>>>>>>>>>>>>>> that user
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in
>>> advance. And
>>>>>>> in that
>>>>>>>>>>>>>>>>>>> case, having
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make
>>> much
>>>>>>> sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class,
>>> based
>>>>>>> on API
>>>>>>>>>>>>>>>>>>> contract, seems
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something
>>> that
>>>>>>> is
>>>>>>>>>>>>>>>>>>> explicitly for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer,
>>> partitioner
>>>>>>> those all
>>>>>>>>>>>>>>>>>>> are producer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions
>>> is
>>>>>>> topic level
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer
>>> level
>>>>>>>>>>>>>>>>>>> configurations
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> together in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface,
>>>>>>> seems like
>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with
>>> non-internal
>>>>>>> (e.g
>>>>>>>>>>>>>>>>> topics
>>>>>>>>>>>>>>>>>>> created
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics
>>> and I
>>>>>>> think we
>>>>>>>>>>>>>>>>>>> should leave
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into
>>> account,
>>>>>>> I think we
>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka
>>>>>>> Streams should
>>>>>>>>>>>>>>>>>>> create and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs
>>>>>>> topics that
>>>>>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding
>>>>>>> numPartitions
>>>>>>>>>>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and
>>> producer
>>>>>>> level
>>>>>>>>>>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API
>>>>>>> semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name
>>> optional
>>>>>>> in
>>>>>>>>>>>>>>>>>>> KStream#through - I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and
>>> giving
>>>>>>> users
>>>>>>>>>>>>>>>>>>> possibility to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more
>>> useful :)
>>>>>>>>>>>>>>>>> Considering
>>>>>>>>>>>>>>>>>>> arguments
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced
>>>>>>> class, I see two
>>>>>>>>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
>>>>>>> auto-generated and
>>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
>>> numOfPartitions)
>>>>>>> - topic
>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of
>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
>>> numOfPartitions,
>>>>>>> final
>>>>>>>>>>>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with
>>> generated
>>>>>>> with
>>>>>>>>>>>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
>>>>>>> parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it
>>> is and
>>>>>>> introduce
>>>>>>>>>>>>>>>>>>> new method
>>>>>>>>>>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias
>>>>>>> suggested this
>>>>>>>>>>>>>>>>>>> in one of
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I
>>>>>>> propose the
>>>>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>>>>>>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
>>>>>>> configuration to
>>>>>>>>>>>>>>>>> Grouped
>>>>>>>>>>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method
>>> overloads to
>>>>>>>>>>>>>>>>>>> KStream#through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
>>>>>>> auto-generated and
>>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
>>> numOfPartitions)
>>>>>>> - topic
>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of
>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
>>> numOfPartitions,
>>>>>>> final
>>>>>>>>>>>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with
>>> generated
>>>>>>> with
>>>>>>>>>>>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
>>>>>>> parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
>>>>>>> configuration to
>>>>>>>>>>>>>>>>> Grouped
>>>>>>>>>>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator
>>>>>>> KStream#repartition for
>>>>>>>>>>>>>>>>>>> creating and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was
>>>>>>> already
>>>>>>>>>>>>>>>>>>> discussed in the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the
>>> threads
>>>>>>> that were
>>>>>>>>>>>>>>>>>>> about this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM,
>>> Levani
>>>>>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>>>>>>>>>>>>>> levani.codes@gmail.com>>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect
>>> discussion
>>>>>>> around
>>>>>>>>>>>>>>>>>>> KIP-221. Going
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems
>>> to
>>>>>>> agreement
>>>>>>>>>>>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation,
>>> as far
>>>>>>> as I
>>>>>>>>>>>>>>>>>>> understood, the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the
>>>>>>> following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to
>>>>>>> KStream#through
>>>>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with
>>>>>>> numOfPartitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL
>>>>>>> users to
>>>>>>>>>>>>>>>>> control
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing
>>> stateful
>>>>>>>>>>>>>>>>> operations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface
>>>>>>> changes around
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>


Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by John Roesler <jo...@confluent.io>.
Thanks Sophie,

I think your concern is valid, and also that your idea to make a
ticket is a good idea.

Creating a ticket has some very positive effects:
* It allows us to record the thinking at this point in time so we
don't have to dig through the mail archives later
* It demonstrates that we did consider the use case, and do want to
address it, but just don't feel confident to implement it right now.
Then, if/when people do have a problem with the gap, the ticket it
already there for them to consider, request, or even pick up.

Since one aspect of the deferral is a desire to wait for real use
experience, we should explicitly mention that in the ticket. This is
just good information for people browsing the Jira looking for
interesting tickets to pick up. They could still pick it up, but they
can ask themselves if they really understand the real-world use cases
any better than we do right now.

Thanks, likewise, to you for the good discussion!
-John

On Fri, Nov 15, 2019 at 2:37 PM Sophie Blee-Goldman <so...@confluent.io> wrote:
>
> While I'm concerned that "not augmenting groupBy as part of this KIP"
> really translates to "will not get around to augmenting groupBy for a long
> time if not as part of this KIP", like I said I don't want to hold up the
> new
> .repartition operator that it seems we do, at least, all agree on. It's a
> fair
> point that we can always add this in later, but undoing it is far more
> problematic.
>
> Anyways, I would be happy if we at least make a ticket to consider adding a
> "number of partitions" option/suggestion to groupBy, so that we don't lose
> all the thought put in to this decision so far and can avoid rehashing the
> same
> argument word for word and have something to point to when someone
> asks "why didn't we add this numPartitions option to groupBy".
>
> Beyond that, if the community isn't pushing for it at this moment then it
> seems very
> reasonable to shelve the idea for now so that the rest of this KIP can
> proceed.
> Without input one way or another it's hard to say what the right thing to
> do is,
> which makes the right thing to do "wait to add this feature"
>
> Thanks for the good discussion everyone,
>
> Sophie
>
> On Fri, Nov 15, 2019 at 12:41 PM John Roesler <jo...@confluent.io> wrote:
>
> > Hi all,
> >
> > I think that Sophie is asking a good question, and I do think that
> > such "blanket configurations" are plausible. For example, we currently
> > support (and I would encourage) "I don't know if this is going to
> > create a repartition topic, but if it does, then use this name instead
> > of generating one".
> >
> > I'm not sure I'm convinced that specifying max parallelism falls into
> > this category. After all, the groupByKey+aggregate will be executed
> > with _some_ max parallelism. It's either the same as the inputs'
> > partition count or overridden with the proposed config. It seems
> > counterintuitive to override the specified option with the default
> > value.
> >
> > I'm not sure if I can put my finger on it, but "maybe use this name"
> > seems way more reasonable to me than "maybe execute with this degree
> > of parallelism".
> >
> > I do think (and I appreciate that this is where Sophie's example is
> > coming from) that Streams should strive to be absolutely as simple and
> > intuitive as possible (while still maintaining correctness). Optimal
> > performance can be at odds with API simplicity. For example, the
> > simplest behavior is, if you ask for 5 partitions, you get 5
> > partitions. Maybe a repartition is technically not necessary (if you
> > didn't change the key), but at least there's no mystery to this
> > behavior.
> >
> > Clearly, an (opposing) tenent of simplicity is trying to prevent
> > people from making mistakes, which I think is what the example boils
> > down to. Sometimes, we can prevent clear mistakes, like equi-joining
> > two topics with different partition counts. But for this case, it
> > doesn't seem as clear-cut to be able to assume that they _said_ 5
> > partitions, but they didn't really _want_ 5 partitions. Maybe we can
> > just try to be clear in the documentation, and also even log a warning
> > when we parse the topology, "hey, I've been asked to repartition this
> > stream, but it's not necessary".
> >
> > If anything, this discussion really supports to me the value in just
> > sticking with `repartition()` for now, and deferring
> > `groupBy[Key](partitions)` to the future.
> >
> > > Users should not have to choose between allowing Streams to optimize the
> > repartition placement, and allowing to specify a number of partitions.
> >
> > This is a very fair point, and it may be something that we rapidly
> > return to, but it seems safe for now to introduce the non-optimizable
> > `reparition()` only, and then consider optimization options later.
> > Skipping available optimizations will never break correctness, but
> > adding optimizations can, so it makes sense to treat them with
> > caution.
> >
> > In conclusion, I do think that a use _could_ want to "maybe specify"
> > the partition count, but I also think we can afford to pass on
> > supporting this right now.
> >
> > I'm open to continuing the discussion, but just to avoid ambiguity, I
> > still feel we should _not_ change the groupBy[Key] operation at all,
> > and we should only add `repartition()` as a non-optimizable operation.
> >
> > Thanks all,
> > -John
> >
> > On Fri, Nov 15, 2019 at 11:26 AM Levani Kokhreidze
> > <le...@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > Just fyi, PR was updated and now it incorporates the latest suggestions
> > about joins.
> > > `CopartitionedTopicsEnforcer` will throw an exception if number of
> > partitions aren’t the same when using `repartition` operation along with
> > `join`.
> > >
> > > For more details please take a look at the PR:
> > https://github.com/apache/kafka/pull/7170/files <
> > https://github.com/apache/kafka/pull/7170/files>
> > >
> > > Regards,
> > > Levani
> > >
> > >
> > > > On Nov 15, 2019, at 11:01 AM, Matthias J. Sax <ma...@confluent.io>
> > wrote:
> > > >
> > > > Thanks a lot for the input Sophie.
> > > >
> > > > Your example is quite useful, and I would use it to support my claim
> > > > that a "partition hint" for `Grouped` seems "useless" and does not
> > > > improve the user experience.
> > > >
> > > > 1) You argue that a new user would be worries about repartitions topics
> > > > with too many paritions. This would imply that a user is already
> > > > advanced enough to understand the implication of repartitioning -- for
> > > > this case, I would argue that a user also understand _when_ a
> > > > auto-repartitioning would happen and thus the users understands where
> > to
> > > > insert a `repartition()` operation.
> > > >
> > > > 2) For specifying Serdes: if a `groupByKey()` does not trigger
> > > > auto-repartitioning it's not required to specify the serdes and if they
> > > > are specified they would be ignored/unused (note, that `groupBy()`
> > would
> > > > always trigger a repartitioning). Of course, if the default Serdes from
> > > > the config match (eg, all data types are Json anyway), a user does not
> > > > need to worry about specifying serdes. -- For new user that play
> > around,
> > > > I would assume that they work a lot with primitive types and thus would
> > > > need to specify the serdes -- hence, they would learn about
> > > > auto-repartitioning the hard way anyhow, because each time a
> > > > `groupByKey()` does trigger auto-repartioning, they would need to pass
> > > > in the correct Serdes -- this way, they would also be educated where to
> > > > insert a `repartition()` operator if needed.
> > > >
> > > > 3) If a new user really just "plays around", I don't think they use an
> > > > input topic with 100 partitions but most likely have a local single
> > node
> > > > broker with most likely single partitions topics.
> > > >
> > > >
> > > > My main argument for my current proposal is however, that---based on
> > > > past experience---it's better to roll out a new feature more carefully
> > > > and see how it goes. Last, as John pointed out, we can still extend the
> > > > feature in the future. Instead of making a judgment call up-front,
> > being
> > > > more conservative and less fancy, and revisit the design based on
> > > > actuall user feedback after the first version is rolled out, seems to
> > be
> > > > the better option. Undoing a feature is must harder than extending it.
> > > >
> > > >
> > > > While I advocate strong for a simple first version of this feature,
> > it's
> > > > a community decission in the end, and I would not block this KIP if
> > > > there is a broad preference to add `Grouped#withNumberOfPartitions()`
> > > > either.
> > > >
> > > >
> > > > -Matthias
> > > >
> > > > On 11/14/19 11:35 PM, Sophie Blee-Goldman wrote:
> > > >> It seems like we all agree at this point (please correct me if
> > wrong!) that
> > > >> we should NOT change
> > > >> the existing repartitioning behavior, ie we should allow Streams to
> > > >> continue to determine when and
> > > >> where to repartition -- *unless* explicitly informed to by the use of
> > a
> > > >> .through or the new .repartition operator.
> > > >>
> > > >> Regarding groupBy, the existing behavior we should not disrupt is
> > > >> a) repartition *only* when required due to upstream key-changing
> > operation
> > > >> (ie don't force repartitioning
> > > >> based on the presence of an optional config parameter), and
> > > >> b) allow optimization of required repartitions, if any
> > > >>
> > > >> Within the constraint of not breaking the existing behavior, this
> > still
> > > >> leaves open the question of whether we
> > > >> want to improve the user experience by allowing to provide groupBy
> > with a
> > > >> *suggestion* for numPartitions (or to
> > > >> put it more fairly, whether that *will* improve the experience). I
> > agree
> > > >> with many of the arguments outlined above but
> > > >> let me just push back on this one issue one final time, and if we
> > can't
> > > >> come to a consensus then I am happy to drop
> > > >> it for now so that the KIP can proceed.
> > > >>
> > > >> Specifically, my proposal would be to simply augment Grouped with an
> > > >> optional numPartitions, understood to
> > > >> indicate the user's desired number of partitions *if Streams decides
> > to
> > > >> repartition due to that groupBy*
> > > >>
> > > >>> if a user cares about the number of partition, the user wants to
> > enforce
> > > >> a repartitioning
> > > >> First, I think we should take a step back and examine this claim. I
> > agree
> > > >> 100% that *if this is true,*
> > > >> *then we should not give groupBy an optional numPartitions.* As far
> > as I
> > > >> see it, there's no argument
> > > >> to be had there if we *presuppose that claim.* But I'm not convinced
> > in
> > > >> that as an axiom of the user
> > > >> experience and think we should be examining that claim itself, not the
> > > >> consequences of it.
> > > >>
> > > >> To give a simple example, let's say some new user is trying out
> > Streams and
> > > >> wants to just play around
> > > >> with it to see if it might be worth looking into. They want to just
> > write
> > > >> up a simple app and test it out on the
> > > >> data in some existing topics they have with a large number of
> > partitions,
> > > >> and a lot of data. They're just messing
> > > >> around, trying new topologies and don't want to go through each new
> > one
> > > >> step by step to determine if (or where)
> > > >> a repartition might be required. They also don't want to force a
> > > >> repartition if it turns out to not be required, so they'd
> > > >> like to avoid the nice new .repartition operator they saw. But given
> > the
> > > >> huge number of input partitions, they'd like
> > > >> to rest assured that if a repartition does end up being required
> > somewhere
> > > >> during dev, it will not be created with
> > > >> the same huge number of partitions that their input topic has -- so
> > they
> > > >> just pass groupBy a small numPartitions
> > > >> suggestion.
> > > >>
> > > >> I know that's a bit of a contrived example but I think it does
> > highlight
> > > >> how and when this might be a considerable
> > > >> quality of life improvement, in particular for new users to Streams
> > and/or
> > > >> during the dev cycle. *You don't want to*
> > > >> *force a repartition if it wasn't necessary, but you don't want to
> > create a
> > > >> topic with a huge partition count either.*
> > > >>
> > > >> Also, while the optimization discussion took us down an interesting
> > but
> > > >> ultimately more distracting road, it's worth
> > > >> pointing out that it is clearly a major win to have as few
> > > >> repartition topics/steps as possible. Given that we
> > > >> don't want to change existing behavior, the optimization framework
> > can only
> > > >> help out when the placement of
> > > >> repartition steps is flexible, which means only those from .groupBy
> > (and
> > > >> not .repartition). *Users should not*
> > > >> *have to choose between allowing Streams to optimize the repartition
> > > >> placement, and allowing to specify a *
> > > >> *number of partitions.*
> > > >>
> > > >> Lastly, I have what may be a stupid question but for my own
> > edification of
> > > >> how groupBy works:
> > > >> if you do a .groupBy and a repartition is NOT required, does it ever
> > need
> > > >> to serialize/deserialize
> > > >> any of the data? In other words, if you pass a key/value serde to
> > groupBy
> > > >> and it doesn't trigger
> > > >> a repartition, is the serde(s) just ignored and thus more like a
> > suggestion
> > > >> than a requirement?
> > > >>
> > > >> So again, I don't want to hold up this KIP forever but I feel we've
> > spent
> > > >> some time getting slightly
> > > >> off track (although certainly into very interesting discussions) yet
> > never
> > > >> really addressed or questioned
> > > >> the basic premise: *could a user want to specify a number of
> > partitions but
> > > >> not enforce a repartition (at that*
> > > >> *specific point in the topology)?*
> > > >>
> > > >>
> > > >>
> > > >> On Fri, Nov 15, 2019 at 12:18 AM Matthias J. Sax <
> > matthias@confluent.io>
> > > >> wrote:
> > > >>
> > > >>> Side remark:
> > > >>>
> > > >>> If the user specifies `repartition()` on both side of the join, we
> > can
> > > >>> actually throw the execption earlier, ie, when we build the topology.
> > > >>>
> > > >>> Current, we can do this check only after Kafka Streams was started,
> > > >>> within `StreamPartitionAssignor#assign()` -- we still need to keep
> > this
> > > >>> check for the case that none or only one side has a user specified
> > > >>> number of partitions though.
> > > >>>
> > > >>>
> > > >>> -Matthias
> > > >>>
> > > >>> On 11/14/19 8:15 AM, John Roesler wrote:
> > > >>>> Thanks, all,
> > > >>>>
> > > >>>> I can get behind just totally leaving out reparation-via-groupBy. If
> > > >>>> we only introduce `repartition()` for now, we're making the minimal
> > > >>>> change to gain the desired capability.
> > > >>>>
> > > >>>> Plus, since we agree that `repartition()` should never be
> > optimizable,
> > > >>>> it's a future-compatible proposal. I.e., if we were to add a
> > > >>>> non-optimizable groupBy(partitions) operation now, and want to make
> > it
> > > >>>> optimizable in the future, we have to worry about topology
> > > >>>> compatibility. Better to just do non-optimizable `repartition()`
> > now,
> > > >>>> and add an optimizable `groupBy(partitions)` in the future (maybe).
> > > >>>>
> > > >>>> About joins, yes, it's a concern, and IMO we should just do the same
> > > >>>> thing we do now... check at runtime that the partition counts on
> > both
> > > >>>> sides match and throw an exception otherwise. What this means as a
> > > >>>> user is that if you explicitly repartition the left side to 100
> > > >>>> partitions, and then join with the right side at 10 partitions, you
> > > >>>> get an exception, since this operation is not possible. You'd either
> > > >>>> have to "step down" the left side again, back to 10 partitions, or
> > you
> > > >>>> could repartition the right side to 100 partitions before the join.
> > > >>>> The choice has to be the user's, since it depends on their desired
> > > >>>> execution parallelism.
> > > >>>>
> > > >>>> Thanks,
> > > >>>> -John
> > > >>>>
> > > >>>> On Thu, Nov 14, 2019 at 12:55 AM Matthias J. Sax <
> > matthias@confluent.io>
> > > >>> wrote:
> > > >>>>>
> > > >>>>> Thanks a lot John. I think the way you decompose the operators is
> > super
> > > >>>>> helpful for this discussion.
> > > >>>>>
> > > >>>>> What you suggest with regard to using `Grouped` and enforcing
> > > >>>>> repartitioning if the number of partitions is specified is
> > certainly
> > > >>>>> possible. However, I am not sure if we _should_ do this. My
> > reasoning is
> > > >>>>> that an enforce repartitioning as introduced via `repartition()`
> > is an
> > > >>>>> expensive operations, and it seems better to demand an more
> > explicit
> > > >>>>> user opt-in to trigger it. Just setting an optional parameter
> > might be
> > > >>>>> too subtle to trigger such a heavy "side effect".
> > > >>>>>
> > > >>>>> While I agree about "usability" in general, I would prefer a more
> > > >>>>> conservative appraoch to introduce this feature, see how it goes,
> > and
> > > >>>>> maybe make it more advance later on. This also applies to what
> > > >>>>> optimzation we may or may not allow (or are able to perform at
> > all).
> > > >>>>>
> > > >>>>> @Levani: Reflecting about my suggestion about `Repartioned extends
> > > >>>>> Grouped`, I agree that it might not be a good idea.
> > > >>>>>
> > > >>>>> Atm, I see an enforces repartitioning as non-optimizable and as a
> > good
> > > >>>>> first step and I would suggest to not intoruce anything else for
> > now.
> > > >>>>> Introducing optimizable enforce repartitioning via `groupBy(...,
> > > >>>>> Grouped)` is something we could add later.
> > > >>>>>
> > > >>>>>
> > > >>>>> Therefore, I would not change `Grouped` but only introduce
> > > >>>>> `repartition()`. Users that use `grouBy()` atm, and want to opt-in
> > to
> > > >>>>> set the number of partitions, would need to rewrite their code to
> > > >>>>> `selectKey(...).repartition(...).groupByKey()`. It's less
> > convinient but
> > > >>>>> also less risky from an API and optimization point of view.
> > > >>>>>
> > > >>>>>
> > > >>>>> @Levani: about joins -> yes, we will need to check the specified
> > number
> > > >>>>> of partitions (if any) and if they don't match, throw an
> > exception. We
> > > >>>>> can discuss this on the PR -- I am just trying to get the PR for
> > KIP-466
> > > >>>>> merged -- your is next on the list :)
> > > >>>>>
> > > >>>>>
> > > >>>>> Thoughts?
> > > >>>>>
> > > >>>>>
> > > >>>>> -Matthias
> > > >>>>>
> > > >>>>>
> > > >>>>> On 11/12/19 4:51 PM, Levani Kokhreidze wrote:
> > > >>>>>> Thank you all for an interesting discussion. This is very
> > enlightening.
> > > >>>>>>
> > > >>>>>> Thank you Matthias for your explanation. Your arguments are very
> > true.
> > > >>> It makes sense that if user specifies number of partitions he/she
> > really
> > > >>> cares that those specifications are applied to internal topics.
> > > >>>>>> Unfortunately, in current implementation this is not true during
> > > >>> `join` operation. As I’ve written in the PR comment, currently, when
> > > >>> `Stream#join` is used, `CopartitionedTopicsEnforcer` chooses max
> > number of
> > > >>> partitions from the two source topics.
> > > >>>>>> I’m not really sure what would be the other way around this
> > situation.
> > > >>> Maybe fail the stream altogether and inform the user to specify same
> > number
> > > >>> of partitions?
> > > >>>>>> Or we should treat join operations in a same way as it is right
> > now
> > > >>> and basically choose max number of partitions even when `repartition`
> > > >>> operation is specified, because Kafka Streams “knows the best” how to
> > > >>> handle joins?
> > > >>>>>> You can check integration tests how it’s being handled currently.
> > Open
> > > >>> to suggestions on that part.
> > > >>>>>>
> > > >>>>>> As for groupBy, I agree and John raised very interesting points.
> > My
> > > >>> arguments for allowing users to specify number of partitions during
> > groupBy
> > > >>> operations mainly was coming from the usability perspective.
> > > >>>>>> So building on top of what John said, maybe it makes sense to make
> > > >>> `groupBy` operations smarter and whenever user specifies
> > > >>> `numberOfPartitions` configuration, repartitioning will be enforced,
> > wdyt?
> > > >>>>>> I’m not going into optimization part yet :) I think it will be
> > part of
> > > >>> separate PR and task, but overall it makes sense to apply
> > optimizations
> > > >>> where number of partitions are the same.
> > > >>>>>>
> > > >>>>>> As for Repartitioned extending Grouped, I kinda feel that it
> > won’t fit
> > > >>> nicely in current API design.
> > > >>>>>> In addition, in the PR review, John mentioned that there were a
> > lot of
> > > >>> troubles in the past trying to use one operation's configuration
> > objects on
> > > >>> other operations.
> > > >>>>>> Also it makes sense to keep them separate in terms of
> > compatibility.
> > > >>>>>> In that case, we don’t have to worry every time Grouped is
> > changed,
> > > >>> what would be the implications on `repartition` operations.
> > > >>>>>>
> > > >>>>>> Kind regards,
> > > >>>>>> Levani
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> On Nov 11, 2019, at 9:13 PM, John Roesler <jo...@confluent.io>
> > wrote:
> > > >>>>>>>
> > > >>>>>>> Ah, thanks for the clarification. I missed your point.
> > > >>>>>>>
> > > >>>>>>> I like the framework you've presented. It does seem simpler to
> > assume
> > > >>>>>>> that they either care about the partition count and want to
> > > >>>>>>> repartition to realize it, or they don't care about the number.
> > > >>>>>>> Returning to this discussion, it does seem unlikely that they
> > care
> > > >>>>>>> about the number and _don't_ care if it actually gets realized.
> > > >>>>>>>
> > > >>>>>>> But then, it still seems like we can just keep the option as
> > part of
> > > >>>>>>> Grouped. As in:
> > > >>>>>>>
> > > >>>>>>> // user does not care
> > > >>>>>>> stream.groupByKey(Grouped /*not specifying partition count*/)
> > > >>>>>>> stream.groupBy(Grouped /*not specifying partition count*/)
> > > >>>>>>>
> > > >>>>>>> // user does care
> > > >>>>>>> stream.repartition(Repartitioned)
> > > >>>>>>> stream.groupByKey(Grouped.numberOfPartitions(...))
> > > >>>>>>> stream.groupBy(Grouped.numberOfPartitions(...))
> > > >>>>>>>
> > > >>>>>>> ----
> > > >>>>>>>
> > > >>>>>>> The above discussion got me thinking about algebra. Matthias is
> > > >>>>>>> absolutely right that `groupByKey(numPartitions)` is equivalent
> > to
> > > >>>>>>> `repartition(numPartitions).groupByKey()`. I'm just not
> > convinced that
> > > >>>>>>> we should force people to apply that expansion themselves vs.
> > having a
> > > >>>>>>> more compact way to express it if they don't care where exactly
> > the
> > > >>>>>>> repartition occurs. However, thinking about these operators
> > > >>>>>>> algebraically can really help *us* narrow down the number of
> > different
> > > >>>>>>> expressions we have to consider.
> > > >>>>>>>
> > > >>>>>>> Let's consider some identities:
> > > >>>>>>>
> > > >>>>>>> A: groupBy(mapper) + agg = mapKey(mapper) + groupByKey + agg
> > > >>>>>>> B: src + ... + groupByKey + agg = src + ... + passthough + agg
> > > >>>>>>> C: mapKey(mapper) + ... + groupByKey + agg
> > > >>>>>>> = mapKey(mapper) + ... + repartition + groupByKey + agg
> > > >>>>>>> D: repartition = sink(managed) + src
> > > >>>>>>>
> > > >>>>>>> In these identities, I used one special identifier (...), which
> > means
> > > >>>>>>> any number (0+) of operations that are not src, mapKey,
> > groupBy[Key],
> > > >>>>>>> repartition, or agg.
> > > >>>>>>>
> > > >>>>>>> For mental clarity, I'm just going to make up a rule that groupBy
> > > >>>>>>> operations are not executable. In other words, you have to get
> > to a
> > > >>>>>>> point where you can apply B to convert a groupByKey into a
> > passthough
> > > >>>>>>> in order to execute the program. This is just a formal way of
> > stating
> > > >>>>>>> what already happens in Kafka Streams.
> > > >>>>>>>
> > > >>>>>>> By applying A, we can just completely leave `groupBy` out of our
> > > >>>>>>> analysis. It trivially decomposes into a mapKey followed by a
> > > >>>>>>> groupByKey.
> > > >>>>>>>
> > > >>>>>>> Then, we can eliminate the "repartition required" case of
> > `groupByKey`
> > > >>>>>>> by applying C followed by D to get to the "no repartition
> > required"
> > > >>>>>>> version of groupByKey, which in turn sets us up to apply B to
> > get an
> > > >>>>>>> executable topology.
> > > >>>>>>>
> > > >>>>>>> Fundamentally, you can think about KIP-221 is as proposing a
> > modified
> > > >>>>>>> D identity in which you can specify the partition count of the
> > managed
> > > >>>>>>> sink topic:
> > > >>>>>>> D': repartition(pc) = sink(managed w/ pc) + src
> > > >>>>>>>
> > > >>>>>>> Since users _could_ apply the identities above, we don't
> > actually have
> > > >>>>>>> to add any partition count to groupBy[Key], but we decided early
> > on in
> > > >>>>>>> the KIP discussion that it's more ergonomic to add it. In that
> > case,
> > > >>>>>>> we also have to modify A and C:
> > > >>>>>>> A': groupBy(mapper, pc) + agg
> > > >>>>>>> = mapKey(mapper) + groupByKey(pc) + agg
> > > >>>>>>> C': mapKey(mapper) + ... + groupByKey(pc) + agg
> > > >>>>>>> = mapKey(mapper) + ... + repartition(pc) + groupByKey + agg
> > > >>>>>>>
> > > >>>>>>> Which sets us up still to always be able to get back to a plain
> > > >>>>>>> `groupByKey` operation (with no `(pc)`) and then apply D' and
> > > >>>>>>> ultimately B to get an executable topology.
> > > >>>>>>>
> > > >>>>>>> What about the optimizer?
> > > >>>>>>> The optimizer applies another set of graph-algebraic identities
> > to
> > > >>>>>>> minimize the number of repartition topics in a topology.
> > > >>>>>>>
> > > >>>>>>> (forgive my ascii art)
> > > >>>>>>>
> > > >>>>>>> E: (merging repartition nodes)
> > > >>>>>>> (...) -> repartition -> X
> > > >>>>>>> \-> repartition -> Y
> > > >>>>>>> =
> > > >>>>>>> (... + repartition) -> X
> > > >>>>>>>    \-> Y
> > > >>>>>>> F: (reordering around repartition)
> > > >>>>>>> Where SVO is any non-key-changing, stateless, operation:
> > > >>>>>>> repartition -> SVO = SVO -> repartition
> > > >>>>>>>
> > > >>>>>>> In terms of these identities, what the optimizer does is apply F
> > > >>>>>>> repeatedly in either direction to a topology to factor out
> > common in
> > > >>>>>>> branches so that it can apply E to merge repartition nodes. This
> > was
> > > >>>>>>> especially necessary before KIP-221 because you couldn't directly
> > > >>>>>>> express `repartition` in the DSL, only indirectly via
> > `groupBy[Key]`,
> > > >>>>>>> so there was no way to do the factoring manually.
> > > >>>>>>>
> > > >>>>>>> We can now state very clearly that in KIP-221, explicit
> > > >>>>>>> `repartition()` operators should create a "reordering barrier".
> > So, F
> > > >>>>>>> cannot be applied to an explicit `repartition()`. Also, I think
> > we
> > > >>>>>>> decided earlier that explicit `repartition()` operations would
> > also be
> > > >>>>>>> ineligible for merging, so E can't be applied to explicit
> > > >>>>>>> `repartition()` operations either. I think we feel we _could_
> > apply E
> > > >>>>>>> without harm, but we want to be conservative for now.
> > > >>>>>>>
> > > >>>>>>> I think the salient point from the latter discussion has been
> > that
> > > >>>>>>> when you use `Grouped.numberOfPartitions`, this does _not_
> > constitute
> > > >>>>>>> an explicit `repartition()` operator, and therefore, the
> > resulting
> > > >>>>>>> repartition node remains eligible for optimization.
> > > >>>>>>>
> > > >>>>>>> To be clear, I agree with Matthias that the provided partition
> > count
> > > >>>>>>> _must_ be used in the resulting implicit repartition. This has
> > some
> > > >>>>>>> implications for E. Namely, E could only be applied to two
> > repartition
> > > >>>>>>> nodes that have the same partition count. This has always been
> > > >>>>>>> trivially true before KIP-221 because the partition count has
> > always
> > > >>>>>>> been "unspecified", i.e., it would be determined at runtime by
> > the
> > > >>>>>>> user-managed-topics' partition counts. Now, it could be
> > specified or
> > > >>>>>>> unspecified. We can simply augment E to allow merging only
> > repartition
> > > >>>>>>> nodes where the partition count is EITHER "specified and the
> > same on
> > > >>>>>>> both sides", OR "unspecified on both sides".
> > > >>>>>>>
> > > >>>>>>> Sorry for the long email, but I have a hope that it builds a
> > solid
> > > >>>>>>> theoretical foundation for our decisions in KIP-221, so we can
> > have
> > > >>>>>>> confidence that there are no edge cases for design flaws to hide.
> > > >>>>>>>
> > > >>>>>>> Thanks,
> > > >>>>>>> -John
> > > >>>>>>>
> > > >>>>>>> On Sat, Nov 9, 2019 at 10:37 PM Matthias J. Sax <
> > > >>> matthias@confluent.io> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> it seems like we do want to allow
> > > >>>>>>>>>> people to optionally specify a partition count as part of this
> > > >>>>>>>>>> operation, but we don't want that option to _force_
> > repartitioning
> > > >>>>>>>>
> > > >>>>>>>> Correct, ie, that is my suggestions.
> > > >>>>>>>>
> > > >>>>>>>>> "Use P partitions if repartitioning is necessary"
> > > >>>>>>>>
> > > >>>>>>>> I disagree here, because my reasoning is that:
> > > >>>>>>>>
> > > >>>>>>>> - if a user cares about the number of partition, the user wants
> > to
> > > >>>>>>>> enforce a repartitioning
> > > >>>>>>>> - if a user does not case about the number of partitions, we
> > don't
> > > >>> need
> > > >>>>>>>> to provide them a way to pass in a "hint"
> > > >>>>>>>>
> > > >>>>>>>> Hence, it should be sufficient to support:
> > > >>>>>>>>
> > > >>>>>>>> // user does not care
> > > >>>>>>>>
> > > >>>>>>>> `stream.groupByKey(Grouped)`
> > > >>>>>>>> `stream.grouBy(..., Grouped)`
> > > >>>>>>>>
> > > >>>>>>>> // user does care
> > > >>>>>>>>
> > > >>>>>>>> `stream.repartition(Repartitioned).groupByKey()`
> > > >>>>>>>> `streams.groupBy(..., Repartitioned)`
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> -Matthias
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On 11/9/19 8:10 PM, John Roesler wrote:
> > > >>>>>>>>> Thanks for those thoughts, Matthias,
> > > >>>>>>>>>
> > > >>>>>>>>> I find your reasoning about the optimization behavior
> > compelling.
> > > >>> The
> > > >>>>>>>>> `through` operation is very simple and clear to reason about.
> > It
> > > >>> just
> > > >>>>>>>>> passes the data exactly at the specified point in the topology
> > > >>> exactly
> > > >>>>>>>>> through the specified topic. Likewise, if a user invokes a
> > > >>>>>>>>> `repartition` operator, the simplest behavior is if we just do
> > what
> > > >>>>>>>>> they asked for.
> > > >>>>>>>>>
> > > >>>>>>>>> Stepping back to think about when optimizations are surprising
> > and
> > > >>>>>>>>> when they aren't, it occurs to me that we should be free to
> > move
> > > >>>>>>>>> around repartitions when users have asked to perform some
> > operation
> > > >>>>>>>>> that implies a repartition, like "change keys, then filter,
> > then
> > > >>>>>>>>> aggregate". This program requires a repartition, but it could
> > be
> > > >>>>>>>>> anywhere between the key change and the aggregation. On the
> > other
> > > >>>>>>>>> hand, if they say, "change keys, then filter, then
> > repartition, then
> > > >>>>>>>>> aggregate", it seems like they were pretty clear about their
> > desire,
> > > >>>>>>>>> and we should just take it at face value.
> > > >>>>>>>>>
> > > >>>>>>>>> So, I'm sold on just literally doing a repartition every time
> > they
> > > >>>>>>>>> invoke the `repartition` operator.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> The "partition count" modifier for `groupBy`/`groupByKey` is
> > more
> > > >>> nuanced.
> > > >>>>>>>>>
> > > >>>>>>>>> What you said about `groupByKey` makes sense. If they key
> > hasn't
> > > >>>>>>>>> actually changed, then we don't need to repartition before
> > > >>>>>>>>> aggregating. On the other hand, `groupBy` is specifically
> > changing
> > > >>> the
> > > >>>>>>>>> key as part of the grouping operation, so (as you said) we
> > > >>> definitely
> > > >>>>>>>>> have to do a repartition.
> > > >>>>>>>>>
> > > >>>>>>>>> If I'm reading the discussion right, it seems like we do want
> > to
> > > >>> allow
> > > >>>>>>>>> people to optionally specify a partition count as part of this
> > > >>>>>>>>> operation, but we don't want that option to _force_
> > repartitioning
> > > >>> if
> > > >>>>>>>>> it's not needed. That last clause is the key. "Use P
> > partitions if
> > > >>>>>>>>> repartitioning is necessary" is a directive that applies
> > cleanly and
> > > >>>>>>>>> correctly to both `groupBy` and `groupByKey`. What if we call
> > the
> > > >>>>>>>>> option `numberOfPartitionsHint`, which along with the "if
> > necessary"
> > > >>>>>>>>> javadoc, should make it clear that the option won't force a
> > > >>>>>>>>> repartition, and also gives us enough latitude to still employ
> > the
> > > >>>>>>>>> optimizer on those repartition topics?
> > > >>>>>>>>>
> > > >>>>>>>>> If we like the idea of expressing it as a "hint" for grouping
> > and a
> > > >>>>>>>>> "command" for `repartition`, then it seems like it still makes
> > sense
> > > >>>>>>>>> to keep Grouped and Repartitioned separate, as they would
> > actually
> > > >>>>>>>>> offer different methods with distinct semantics.
> > > >>>>>>>>>
> > > >>>>>>>>> WDYT?
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks,
> > > >>>>>>>>> -John
> > > >>>>>>>>>
> > > >>>>>>>>> On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <
> > > >>> matthias@confluent.io> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> Sorry for late reply.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I guess, the question boils down to the intended semantics of
> > > >>>>>>>>>> `repartition()`. My understanding is as follows:
> > > >>>>>>>>>>
> > > >>>>>>>>>> - KS does auto-repartitioning for correctness reasons (using
> > the
> > > >>>>>>>>>> upstream topic to determine the number of partitions)
> > > >>>>>>>>>> - KS does auto-repartitioning only for downstream DSL
> > operators
> > > >>> like
> > > >>>>>>>>>> `count()` (eg, a `transform()` does never trigger an
> > > >>> auto-repartitioning
> > > >>>>>>>>>> even if the stream is marked as `repartitioningRequired`).
> > > >>>>>>>>>> - KS offers `through()` to enforce a repartitioning --
> > however,
> > > >>> the user
> > > >>>>>>>>>> needs to create the topic manually (with the desired number of
> > > >>> partitions).
> > > >>>>>>>>>>
> > > >>>>>>>>>> I see two main applications for `repartitioning()`:
> > > >>>>>>>>>>
> > > >>>>>>>>>> 1) repartition data before a `transform()` but user does not
> > want
> > > >>> to
> > > >>>>>>>>>> manage the topic
> > > >>>>>>>>>> 2) scale out a downstream subtopology
> > > >>>>>>>>>>
> > > >>>>>>>>>> Hence, I see `repartition()` similar to `through()`: if a
> > users
> > > >>> calls
> > > >>>>>>>>>> it, a repartitining is enforced, with the difference that KS
> > > >>> manages the
> > > >>>>>>>>>> topic and the user does not need to create it.
> > > >>>>>>>>>>
> > > >>>>>>>>>> This behavior makes (1) and (2) possible.
> > > >>>>>>>>>>
> > > >>>>>>>>>>> I think many users would prefer to just say "if there *is* a
> > > >>> repartition
> > > >>>>>>>>>>> required at this point in the topology, it should
> > > >>>>>>>>>>> have N partitions"
> > > >>>>>>>>>>
> > > >>>>>>>>>> Because of (2), I disagree. Either a user does not care about
> > > >>> scaling
> > > >>>>>>>>>> out, for which case she would not specify the number of
> > > >>> partitions. Or a
> > > >>>>>>>>>> user does care, and hence wants to enforce the scale out. I
> > don't
> > > >>> think
> > > >>>>>>>>>> that any user would say, "maybe scale out".
> > > >>>>>>>>>>
> > > >>>>>>>>>> Therefore, the optimizer should never ignore the repartition
> > > >>> operation.
> > > >>>>>>>>>> As a "consequence" (because repartitioning is expensive) a
> > user
> > > >>> should
> > > >>>>>>>>>> make an explicit call to `repartition()` IMHO -- piggybacking
> > an
> > > >>>>>>>>>> enforced repartitioning into `groupByKey()` seems to be
> > "dangerous"
> > > >>>>>>>>>> because it might be too subtle and an "optional scaling out"
> > as
> > > >>> laid out
> > > >>>>>>>>>> above does not make sense IMHO.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I am also not worried about "over repartitioning" because the
> > > >>> result
> > > >>>>>>>>>> stream would never trigger auto-repartitioning. Only if
> > multiple
> > > >>>>>>>>>> consecutive calls to `repartition()` are made it could be bad
> > --
> > > >>> but
> > > >>>>>>>>>> that's the same with `through()`. In the end, there is always
> > some
> > > >>>>>>>>>> responsibility on the user.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Btw, for `.groupBy()` we know that repartitioning will be
> > required,
> > > >>>>>>>>>> however, for `groupByKey()` it depends if the KStream is
> > marked as
> > > >>>>>>>>>> `repartitioningRequired`.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Hence, for `groupByKey()` it should not be possible for a
> > user to
> > > >>> set
> > > >>>>>>>>>> number of partitions IMHO. For `groupBy()` it's a different
> > story,
> > > >>>>>>>>>> because calling
> > > >>>>>>>>>>
> > > >>>>>>>>>>  `repartition().groupBy()`
> > > >>>>>>>>>>
> > > >>>>>>>>>> does not achieve what we want. Hence, allowing users to pass
> > in the
> > > >>>>>>>>>> number of users partitions into `groupBy()` does actually
> > makes
> > > >>> sense,
> > > >>>>>>>>>> because repartitioning will happen anyway and thus we can
> > > >>> piggyback a
> > > >>>>>>>>>> scaling decision.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I think that John has a fair concern about the overloads,
> > however,
> > > >>> I am
> > > >>>>>>>>>> not convinced that using `Grouped` to specify the number of
> > > >>> partitions
> > > >>>>>>>>>> is intuitive. I double checked `Grouped` and `Repartitioned`
> > and
> > > >>> both
> > > >>>>>>>>>> allow to specify a `name` and `keySerde/valueSerde`. Thus, I
> > am
> > > >>>>>>>>>> wondering if we could bridge the gap between both, if we
> > would make
> > > >>>>>>>>>> `Repartitioned extends Grouped`? For this case, we only need
> > > >>>>>>>>>> `groupBy(Grouped)` and a user can pass in both types what
> > seems to
> > > >>> make
> > > >>>>>>>>>> the API quite smooth:
> > > >>>>>>>>>>
> > > >>>>>>>>>> `stream.groupBy(..., Grouped...)`
> > > >>>>>>>>>>
> > > >>>>>>>>>> `stream.groupBy(..., Repartitioned...)`
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thoughts?
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> -Matthias
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
> > > >>>>>>>>>>> Hi Sophie,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thank you for your reply, very insightful. Looking forward
> > > >>> hearing others opinion as well on this.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Kind regards,
> > > >>>>>>>>>>> Levani
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <
> > > >>> sophie@confluent.io> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
> > > >>> other hand
> > > >>>>>>>>>>>> Kafka Streams has already
> > > >>>>>>>>>>>>> optimizer in place which alters topology independently
> > from user
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> I agree (with you) and think this is a good way to put it
> > -- we
> > > >>> currently
> > > >>>>>>>>>>>> auto-repartition for the user so
> > > >>>>>>>>>>>> that they don't have to walk through their entire topology
> > and
> > > >>> reason about
> > > >>>>>>>>>>>> when and where to place a
> > > >>>>>>>>>>>> `.through` (or the new `.repartition`), so why suddenly
> > force
> > > >>> this onto the
> > > >>>>>>>>>>>> user? How certain are we that
> > > >>>>>>>>>>>> users will always get this right? It's easy to imagine that
> > > >>> during
> > > >>>>>>>>>>>> development, you write your new app with
> > > >>>>>>>>>>>> correctly placed repartitions in order to use this new
> > feature.
> > > >>> During the
> > > >>>>>>>>>>>> course of development you end up
> > > >>>>>>>>>>>> tweaking the topology, but don't remember to review or move
> > the
> > > >>>>>>>>>>>> repartitioning since you're used to Streams
> > > >>>>>>>>>>>> doing this for you. If you use only single-partition topics
> > for
> > > >>> testing,
> > > >>>>>>>>>>>> you might not even notice your app is
> > > >>>>>>>>>>>> spitting out incorrect results!
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Anyways, I feel pretty strongly that it would be weird to
> > > >>> introduce a new
> > > >>>>>>>>>>>> feature and say that to use it, you can't take
> > > >>>>>>>>>>>> advantage of this other feature anymore. Also, is it
> > possible our
> > > >>>>>>>>>>>> optimization framework could ever include an
> > > >>>>>>>>>>>> optimized repartitioning strategy that is better than what a
> > > >>> user could
> > > >>>>>>>>>>>> achieve by manually inserting repartitions?
> > > >>>>>>>>>>>> Do we expect users to have a deep understanding of the best
> > way
> > > >>> to
> > > >>>>>>>>>>>> repartition their particular topology, or is it
> > > >>>>>>>>>>>> likely they will end up over-repartitioning either due to
> > missed
> > > >>>>>>>>>>>> optimizations or unnecessary extra repartitions?
> > > >>>>>>>>>>>> I think many users would prefer to just say "if there *is* a
> > > >>> repartition
> > > >>>>>>>>>>>> required at this point in the topology, it should
> > > >>>>>>>>>>>> have N partitions"
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> As to the idea of adding `numberOfPartitions` to Grouped
> > rather
> > > >>> than
> > > >>>>>>>>>>>> adding a new parameter to groupBy, that does seem more in
> > line
> > > >>> with the
> > > >>>>>>>>>>>> current syntax so +1 from me
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <
> > > >>> levani.codes@gmail.com>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Hello all,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> While https://github.com/apache/kafka/pull/7170 <
> > > >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> is under
> > review and
> > > >>> it’s
> > > >>>>>>>>>>>>> almost done, I want to resurrect discussion about this KIP
> > to
> > > >>> address
> > > >>>>>>>>>>>>> couple of concerns raised by Matthias and John.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> As a reminder, idea of the KIP-221 was to allow DSL users
> > > >>> control over
> > > >>>>>>>>>>>>> repartitioning and parallelism of sub-topologies by:
> > > >>>>>>>>>>>>> 1) Introducing new KStream#repartition operation which is
> > done
> > > >>> in
> > > >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
> > > >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>
> > > >>>>>>>>>>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which
> > is
> > > >>> planned to
> > > >>>>>>>>>>>>> be separate PR.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> While all agree about general implementation and idea
> > behind
> > > >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
> > > >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> PR,
> > introducing new
> > > >>>>>>>>>>>>> KStream#groupBy(Repartitioned) method overload raised some
> > > >>> questions during
> > > >>>>>>>>>>>>> the review.
> > > >>>>>>>>>>>>> Matthias raised concern that there can be cases when user
> > uses
> > > >>>>>>>>>>>>> `KStream#groupBy(Repartitioned)` operation, but actual
> > > >>> repartitioning may
> > > >>>>>>>>>>>>> not required, thus configuration passed via `Repartitioned`
> > > >>> would never be
> > > >>>>>>>>>>>>> applied (Matthias, please correct me if I misinterpreted
> > your
> > > >>> comment).
> > > >>>>>>>>>>>>> So instead, if user wants to control parallelism of
> > > >>> sub-topologies, he or
> > > >>>>>>>>>>>>> she should always use `KStream#repartition` operation
> > before
> > > >>> groupBy. Full
> > > >>>>>>>>>>>>> comment can be seen here:
> > > >>>>>>>>>>>>>
> > > >>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
> > > >>>>>>>>>>>>>
> > > >>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On the same topic, John pointed out that, from API design
> > > >>> perspective, we
> > > >>>>>>>>>>>>> shouldn’t intertwine configuration classes of different
> > > >>> operators between
> > > >>>>>>>>>>>>> one another. So instead of introducing new
> > > >>> `KStream#groupBy(Repartitioned)`
> > > >>>>>>>>>>>>> for specifying number of partitions for internal topic, we
> > > >>> should update
> > > >>>>>>>>>>>>> existing `Grouped` class with `numberOfPartitions` field.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
> > > >>> other hand
> > > >>>>>>>>>>>>> Kafka Streams has already optimizer in place which alters
> > > >>> topology
> > > >>>>>>>>>>>>> independently from user. So maybe it makes sense if Kafka
> > > >>> Streams,
> > > >>>>>>>>>>>>> internally would optimize topology in the best way
> > possible,
> > > >>> even if in
> > > >>>>>>>>>>>>> some cases this means ignoring some operator configurations
> > > >>> passed by the
> > > >>>>>>>>>>>>> user. Also, I agree with John about API design semantics.
> > If we
> > > >>> go through
> > > >>>>>>>>>>>>> with the changes for `KStream#groupBy` operation, it makes
> > more
> > > >>> sense to
> > > >>>>>>>>>>>>> add `numberOfPartitions` field to `Grouped` class instead
> > of
> > > >>> introducing
> > > >>>>>>>>>>>>> new `KStream#groupBy(Repartitioned)` method overload.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> I would really appreciate communities feedback on this.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Kind regards,
> > > >>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <
> > > >>> sophie@confluent.io>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Hey Levani,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> I think people are busy with the upcoming 2.4 release, and
> > > >>> don't have
> > > >>>>>>>>>>>>> much
> > > >>>>>>>>>>>>>> spare time at the
> > > >>>>>>>>>>>>>> moment. It's kind of a difficult time to get attention on
> > > >>> things, but
> > > >>>>>>>>>>>>> feel
> > > >>>>>>>>>>>>>> free to pick up something else
> > > >>>>>>>>>>>>>> to work on in the meantime until things have calmed down
> > a bit!
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Cheers,
> > > >>>>>>>>>>>>>> Sophie
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
> > > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
> > > >>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Hello all,
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Sorry for bringing this thread again, but I would like
> > to get
> > > >>> some
> > > >>>>>>>>>>>>>>> attention on this PR:
> > > >>> https://github.com/apache/kafka/pull/7170 <
> > > >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> <
> > > >>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
> > > >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>>
> > > >>>>>>>>>>>>>>> It's been a while now and I would love to move on to
> > other
> > > >>> KIPs as well.
> > > >>>>>>>>>>>>>>> Please let me know if you have any concerns.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
> > > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
> > > >>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Here’s voting thread for this KIP:
> > > >>>>>>>>>>>>>>>
> > > >>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
> > > >>>>>>>>>>>>>
> > https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>
> > > >>> <
> > > >>>>>>>>>>>>>>>
> > > >>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
> > > >>>>>>>>>>>>>
> > https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html
> > > >>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
> > > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> > > >>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
> > levani.codes@gmail.com>>>
> > > >>> wrote:
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Hi Matthias,
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion
> > on
> > > >>> that one.
> > > >>>>>>>>>>>>>>>>> Agree that avoiding unnecessary method overloads is a
> > good
> > > >>> idea.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Updated KIP
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <
> > > >>> matthias@confluent.io
> > > >>>>>>>>>>>>> <ma...@confluent.io>
> > > >>>>>>>>>>>>>>> <mailto:matthias@confluent.io <mailto:
> > matthias@confluent.io>>>
> > > >>> wrote:
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> One question:
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Why do we add
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Repartitioned#with(final String name, final int
> > > >>> numberOfPartitions)
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> It seems that `#with(String name)`,
> > > >>> `#numberOfPartitions(int)` in
> > > >>>>>>>>>>>>>>>>>> combination with `withName()` and
> > > >>> `withNumberOfPartitions()` should
> > > >>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>>> sufficient. Users can chain the method calls.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> (I think it's valuable to keep the number of overload
> > > >>> small if
> > > >>>>>>>>>>>>>>> possible.)
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Otherwise LGTM.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> -Matthias
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
> > > >>>>>>>>>>>>>>>>>>> Hello,
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Thanks all for your feedback.
> > > >>>>>>>>>>>>>>>>>>> I started voting procedure for this KIP. If there’re
> > any
> > > >>> other
> > > >>>>>>>>>>>>>>> concerns about this KIP, please let me know.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
> > > >>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> > > >>> <mailto:
> > > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>
> > wrote:
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Hi Matthias,
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, makes sense.
> > > >>>>>>>>>>>>>>>>>>>> I’ve updated KIP (
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> ).
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
> > > >>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>
> > > >>>>>>>>>>>>>>> <mailto:matthias@confluent.io <mailto:
> > matthias@confluent.io>>
> > > >>> <mailto:
> > > >>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>
> > <mailto:
> > > >>>>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>>>>
> > > >>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Thanks for driving the KIP.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> I agree that users need to be able to specify a
> > > >>> partitioning
> > > >>>>>>>>>>>>>>> strategy.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Sophie raises a fair point about topic configs and
> > > >>> producer
> > > >>>>>>>>>>>>>>> configs. My
> > > >>>>>>>>>>>>>>>>>>>>> take is, that consider `Repartitioned` as an
> > > >>> "extension" to
> > > >>>>>>>>>>>>>>> `Produced`,
> > > >>>>>>>>>>>>>>>>>>>>> that adds topic configuration, is a good way to
> > think
> > > >>> about it and
> > > >>>>>>>>>>>>>>> helps
> > > >>>>>>>>>>>>>>>>>>>>> to keep the API "clean".
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> With regard to method names. I would prefer to
> > avoid
> > > >>>>>>>>>>>>> abbreviations.
> > > >>>>>>>>>>>>>>> Can
> > > >>>>>>>>>>>>>>>>>>>>> we rename:
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Furthermore, it might be good to add some more
> > `static`
> > > >>> methods:
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
> > > >>>>>>>>>>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
> > > >>>>>>>>>>>>>>>>>>>>> -
> > Repartitioned.streamPartitioner(StreamPartitioner)
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> -Matthias
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
> > > >>>>>>>>>>>>>>>>>>>>>> Totally agree. I think in KStream interface it
> > makes
> > > >>> sense to
> > > >>>>>>>>>>>>> have
> > > >>>>>>>>>>>>>>> some duplicate configurations between operators in order
> > to
> > > >>> keep API
> > > >>>>>>>>>>>>> simple
> > > >>>>>>>>>>>>>>> and usable.
> > > >>>>>>>>>>>>>>>>>>>>>> Also, as more surface API has, harder it is to
> > have
> > > >>> proper
> > > >>>>>>>>>>>>>>> backward compatibility.
> > > >>>>>>>>>>>>>>>>>>>>>> While initial idea of keeping topic level configs
> > > >>> separate was
> > > >>>>>>>>>>>>>>> exciting, having Repartitioned class encapsulate some
> > > >>> producer level
> > > >>>>>>>>>>>>>>> configs makes API more readable.
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman
> > <
> > > >>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
> > <mailto:
> > > >>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> > > >>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
> > <mailto:
> > > >>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> I think that is a good point about trying to keep
> > > >>> producer level
> > > >>>>>>>>>>>>>>>>>>>>>>> configurations and (repartition) topic level
> > > >>> considerations
> > > >>>>>>>>>>>>>>> separate.
> > > >>>>>>>>>>>>>>>>>>>>>>> Number of partitions is definitely purely a topic
> > > >>> level
> > > >>>>>>>>>>>>>>> configuration. But
> > > >>>>>>>>>>>>>>>>>>>>>>> on some level, serdes and partitioners are just
> > as
> > > >>> much a topic
> > > >>>>>>>>>>>>>>>>>>>>>>> configuration as a producer one. You could have
> > two
> > > >>> producers
> > > >>>>>>>>>>>>>>> configured
> > > >>>>>>>>>>>>>>>>>>>>>>> with different serdes and/or partitioners, but if
> > > >>> they are
> > > >>>>>>>>>>>>>>> writing to the
> > > >>>>>>>>>>>>>>>>>>>>>>> same topic the result would be very difficult to
> > > >>> part. So in a
> > > >>>>>>>>>>>>>>> sense, these
> > > >>>>>>>>>>>>>>>>>>>>>>> are configurations of topics in Streams, not just
> > > >>> producers.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Another way to think of it: while the Streams
> > API is
> > > >>> not always
> > > >>>>>>>>>>>>>>> true to
> > > >>>>>>>>>>>>>>>>>>>>>>> this, ideally all the relevant configs for an
> > > >>> operator are
> > > >>>>>>>>>>>>>>> wrapped into a
> > > >>>>>>>>>>>>>>>>>>>>>>> single object (in this case, Repartitioned). We
> > could
> > > >>> instead
> > > >>>>>>>>>>>>>>> split out the
> > > >>>>>>>>>>>>>>>>>>>>>>> fields in common with Produced into a separate
> > > >>> parameter to keep
> > > >>>>>>>>>>>>>>> topic and
> > > >>>>>>>>>>>>>>>>>>>>>>> producer level configurations separate, but this
> > > >>> increases the
> > > >>>>>>>>>>>>>>> API surface
> > > >>>>>>>>>>>>>>>>>>>>>>> area by a lot. It's much more straightforward to
> > just
> > > >>> say "this
> > > >>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>>>>>>>> everything that this particular operator needs"
> > > >>> without worrying
> > > >>>>>>>>>>>>>>> about what
> > > >>>>>>>>>>>>>>>>>>>>>>> exactly you're specifying.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> I suppose you could alternatively make Produced a
> > > >>> field of
> > > >>>>>>>>>>>>>>> Repartitioned,
> > > >>>>>>>>>>>>>>>>>>>>>>> but I don't think we do this kind of composition
> > > >>> elsewhere in
> > > >>>>>>>>>>>>>>> Streams at
> > > >>>>>>>>>>>>>>>>>>>>>>> the moment
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani
> > Kokhreidze <
> > > >>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> > > >>> <mailto:
> > > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
> > > >>> <mailto:
> > > >>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> > > >>> <mailto:
> > > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Hi Bill,
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot for the feedback.
> > > >>>>>>>>>>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
> > > >>>>>>>>>>>>>>> `Repartitioned#partitioner`
> > > >>>>>>>>>>>>>>>>>>>>>>>> configuration.
> > > >>>>>>>>>>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class
> > for
> > > >>> topic level
> > > >>>>>>>>>>>>>>>>>>>>>>>> configuration and keep topic level and producer
> > level
> > > >>>>>>>>>>>>>>> configurations (such
> > > >>>>>>>>>>>>>>>>>>>>>>>> as Produced) separately (see my second email in
> > this
> > > >>> thread).
> > > >>>>>>>>>>>>>>>>>>>>>>>> But while looking at the semantics of KStream
> > > >>> interface, I
> > > >>>>>>>>>>>>>>> couldn’t really
> > > >>>>>>>>>>>>>>>>>>>>>>>> figure out good operation name for Topic level
> > > >>> configuration
> > > >>>>>>>>>>>>>>> class and just
> > > >>>>>>>>>>>>>>>>>>>>>>>> introducing `Topic` config class was kinda
> > breaking
> > > >>> the
> > > >>>>>>>>>>>>>>> semantics.
> > > >>>>>>>>>>>>>>>>>>>>>>>> So I think having Repartitioned class which
> > > >>> encapsulates topic
> > > >>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>> producer level configurations for internal
> > topics is
> > > >>> viable
> > > >>>>>>>>>>>>>>> thing to do.
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <
> > > >>> bbejeck@gmail.com
> > > >>>>>>>>>>>>> <ma...@gmail.com>
> > > >>>>>>>>>>>>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>>
> > > >>> <mailto:
> > > >>>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
> > > >>>>>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Hi Lavani,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.
> > In
> > > >>> addition to
> > > >>>>>>>>>>>>>>> the reason
> > > >>>>>>>>>>>>>>>>>>>>>>>>> provided by John, my reasoning is:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> 1. Users may want to use something other than
> > > >>> hash-based
> > > >>>>>>>>>>>>>>> partitioning
> > > >>>>>>>>>>>>>>>>>>>>>>>>> 2. Users may wish to partition on something
> > > >>> different than the
> > > >>>>>>>>>>>>>>> key
> > > >>>>>>>>>>>>>>>>>>>>>>>>> without having to change the key.  For example:
> > > >>>>>>>>>>>>>>>>>>>>>>>>> 1. A combination of fields in the value in
> > > >>> conjunction with
> > > >>>>>>>>>>>>>>> the key
> > > >>>>>>>>>>>>>>>>>>>>>>>>> 2. Something other than the key
> > > >>>>>>>>>>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on
> > > >>> Produced hence
> > > >>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes
> > sense
> > > >>> for API
> > > >>>>>>>>>>>>>>> consistency.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Just my  2 cents.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Bill
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani
> > Kokhreidze <
> > > >>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> > > >>> levani.codes@gmail.com>
> > > >>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
> > levani.codes@gmail.com>>
> > > >>> <mailto:
> > > >>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> > > >>> <mailto:
> > > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Hi John,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> In my mind it makes sense.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> If we add partitioner configuration to
> > > >>> Repartitioned class,
> > > >>>>>>>>>>>>>>> with the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> combination of specifying number of
> > partitions for
> > > >>> internal
> > > >>>>>>>>>>>>>>> topics, user
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> will have opportunity to ensure
> > co-partitioning
> > > >>> before join
> > > >>>>>>>>>>>>>>> operation.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Wondering what others think about this?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
> > > >>>>>>>>>>>>> john@confluent.io <ma...@confluent.io>
> > > >>>>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>
> > > >>> <mailto:
> > > >>>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
> > > >>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind.
> > Again,
> > > >>> not totally
> > > >>>>>>>>>>>>>>> sure it
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar
> > is
> > > >>> the
> > > >>>>>>>>>>>>> rationale
> > > >>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> -John
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani
> > Kokhreidze
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
> > > >>> levani.codes@gmail.com>
> > > >>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
> > levani.codes@gmail.com>>
> > > >>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
> > levani.codes@gmail.com>
> > > >>> <mailto:
> > > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> > > >>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey John,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your
> > example
> > > >>> I would
> > > >>>>>>>>>>>>>>> first issue
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper
> > partitioner
> > > >>> that
> > > >>>>>>>>>>>>>>> essentially
> > > >>>>>>>>>>>>>>>>>>>>>>>> would
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> be the same as the topic I want to join with
> > and
> > > >>> then do the
> > > >>>>>>>>>>>>>>>>>>>>>>>> KStream#join
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> with DSL?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler
> > <
> > > >>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
> > > >>> john@confluent.io
> > > >>>>>>>>>>>>> <ma...@confluent.io>> <mailto:john@confluent.io
> > <mailto:
> > > >>>>>>>>>>>>> john@confluent.io>
> > > >>>>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an
> > option to
> > > >>> specify
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is
> > that
> > > >>> some data may
> > > >>>>>>>>>>>>>>> get
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input
> > > >>> topic. If the
> > > >>>>>>>>>>>>>>> right-side
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning
> > > >>> strategy, then the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be
> > > >>> partitioned with the
> > > >>>>>>>>>>>>>>> same
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss
> > > >>> something
> > > >>>>>>>>>>>>>>> important?
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> -John
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani
> > > >>> Kokhreidze
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
> > > >>> levani.codes@gmail.com>
> > > >>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
> > levani.codes@gmail.com>>
> > > >>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
> > levani.codes@gmail.com>
> > > >>> <mailto:
> > > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> > > >>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To
> > be
> > > >>> honest I’m
> > > >>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>>> sure
> > > >>>>>>>>>>>>>>>>>>>>>>>> about
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> it yet.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really
> > > >>> think I would
> > > >>>>>>>>>>>>>>> need control
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> over partitioner for internal topics.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka
> > Streams
> > > >>> knows best
> > > >>>>>>>>>>>>>>> how to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> partition data for internal topics.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should
> > be
> > > >>> used only for
> > > >>>>>>>>>>>>>>> topics
> > > >>>>>>>>>>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> are created by user In advance.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have
> > > >>> possibility to
> > > >>>>>>>>>>>>>>> specify
> > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> partitioner.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet,
> > but I
> > > >>> guess
> > > >>>>>>>>>>>>>>> specifying the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s
> > > >>> agreement on
> > > >>>>>>>>>>>>> this.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie
> > > >>> Blee-Goldman <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <mailto:
> > sophie@confluent.io>
> > > >>> <mailto:
> > > >>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> > > >>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
> > wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that
> > > >>> Repartitioned
> > > >>>>>>>>>>>>>>> would be a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> useful
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also
> > need
> > > >>> to have
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field,
> > similar
> > > >>> to
> > > >>>>>>>>>>>>>>> Produced? I'm not
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> sure how
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but
> > seems
> > > >>> it should
> > > >>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>>>>>>>>> available
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartition topics.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani
> > > >>> Kokhreidze <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> > > >>> levani.codes@gmail.com>
> > > >>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
> > levani.codes@gmail.com>
> > > >>> <mailto:
> > > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by
> > Kafka
> > > >>> Streams.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user
> > more
> > > >>> control over
> > > >>>>>>>>>>>>>>> the topic
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> such as
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num of partitions.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is
> > > >>> something that
> > > >>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>> missing
> > > >>>>>>>>>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current DSL design.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over
> > > >>> parallelism by
> > > >>>>>>>>>>>>>>> configuring
> > > >>>>>>>>>>>>>>>>>>>>>>>> num
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie
> > > >>> Blee-Goldman <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <mailto:
> > sophie@confluent.io>
> > > >>> <mailto:
> > > >>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> > > >>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one
> > > >>> thing for me
> > > >>>>>>>>>>>>> --
> > > >>>>>>>>>>>>>>> for the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a
> > > >>> Repartitioned,
> > > >>>>>>>>>>>>>>> will the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> topic be
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems
> > to be
> > > >>> the case
> > > >>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> signature
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it
> > have to
> > > >>> be
> > > >>>>>>>>>>>>>>> pre-created? The
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> wording
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version
> > of
> > > >>> the method
> > > >>>>>>>>>>>>>>> will
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> auto-create
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sophie
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani
> > > >>> Kokhreidze <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> > > >>> levani.codes@gmail.com>
> > > >>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
> > levani.codes@gmail.com>
> > > >>> <mailto:
> > > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> )
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing
> > list :)
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities
> > > >>> opinions/concerns
> > > >>>>>>>>>>>>> about
> > > >>>>>>>>>>>>>>> this KIP.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani
> > > >>> Kokhreidze <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani
> > > >>> Kokhreidze <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>>
> > wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward,
> > I’ve
> > > >>> updated
> > > >>>>>>>>>>>>>>> confluence
> > > >>>>>>>>>>>>>>>>>>>>>>>> page
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> with
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected
> > Alternatives”
> > > >>> section.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP
> > :)
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani
> > > >>> Kokhreidze <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>>
> > wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing
> > > >>> dedicated `Topic`
> > > >>>>>>>>>>>>>>> class for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> topic
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators
> > like
> > > >>>>>>>>>>>>> `groupedBy`.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others
> > opinion
> > > >>> about this
> > > >>>>>>>>>>>>> as
> > > >>>>>>>>>>>>>>> well.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM,
> > Matthias
> > > >>> J. Sax <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> matthias@confluent.io
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>>
> > wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP!
> > And
> > > >>> thanks for
> > > >>>>>>>>>>>>>>> summarizing
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been
> > > >>> discussed
> > > >>>>>>>>>>>>>>> already (can't
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> really
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a
> > good
> > > >>> summary to
> > > >>>>>>>>>>>>>>> refresh the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes
> > sense.
> > > >>> With regard
> > > >>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> distinction
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage
> > topics
> > > >>> and
> > > >>>>>>>>>>>>> operators
> > > >>>>>>>>>>>>>>> that use
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user-created
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument,
> > it
> > > >>> might
> > > >>>>>>>>>>>>> indicate
> > > >>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> leaving
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator
> > that
> > > >>> uses
> > > >>>>>>>>>>>>>>> use-defined
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> topics) and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()`
> > > >>> operator (an
> > > >>>>>>>>>>>>>>> operator that
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> manages
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good.
> > > >>> Otherwise, there is
> > > >>>>>>>>>>>>>>> one
> > > >>>>>>>>>>>>>>>>>>>>>>>> operator
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages
> > > >>> topics but
> > > >>>>>>>>>>>>>>> sometimes
> > > >>>>>>>>>>>>>>>>>>>>>>>> not; a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make
> > the
> > > >>> distinction
> > > >>>>>>>>>>>>>>> clearer.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to
> > > >>> `Grouped`. I am
> > > >>>>>>>>>>>>>>> wondering
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> if the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does
> > apply
> > > >>> and adding
> > > >>>>>>>>>>>>>>> it is
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> semantically
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get
> > > >>> opinions of
> > > >>>>>>>>>>>>>>> others on
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> this, too.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> am
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I
> > prefer
> > > >>> atm.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration
> > objects
> > > >>> that allow
> > > >>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>> configure
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer,
> > producer,
> > > >>> store. If we
> > > >>>>>>>>>>>>>>> assume that
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> a topic
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if
> > we
> > > >>> should have a
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()`
> > class
> > > >>> and method
> > > >>>>>>>>>>>>>>> instead of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> a plain
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to
> > add
> > > >>> other
> > > >>>>>>>>>>>>> configs,
> > > >>>>>>>>>>>>>>> like
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> replication
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc,
> > easily,
> > > >>> without the
> > > >>>>>>>>>>>>>>> need to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> change the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "main
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not
> > sure
> > > >>> if I like
> > > >>>>>>>>>>>>>>> them
> > > >>>>>>>>>>>>>>>>>>>>>>>> myself.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> :)
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani
> > Kokhreidze
> > > >>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though -
> > > >>> maybe
> > > >>>>>>>>>>>>> enhancing
> > > >>>>>>>>>>>>>>> Produced
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> with num
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the
> > > >>> best approach.
> > > >>>>>>>>>>>>>>> Let me
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> explain
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class
> > with
> > > >>> this
> > > >>>>>>>>>>>>>>> configuration,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> this will
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation.
> > Since
> > > >>> KStream#to is
> > > >>>>>>>>>>>>>>> the final
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> sink of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be
> > reasonable
> > > >>>>>>>>>>>>> assumption
> > > >>>>>>>>>>>>>>> that user
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> needs
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in
> > advance. And
> > > >>> in that
> > > >>>>>>>>>>>>>>> case, having
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> num of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make
> > much
> > > >>> sense.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class,
> > based
> > > >>> on API
> > > >>>>>>>>>>>>>>> contract, seems
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> like
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something
> > that
> > > >>> is
> > > >>>>>>>>>>>>>>> explicitly for
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> producer
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (key
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer,
> > partitioner
> > > >>> those all
> > > >>>>>>>>>>>>>>> are producer
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions
> > is
> > > >>> topic level
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> configuration. And
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer
> > level
> > > >>>>>>>>>>>>>>> configurations
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> together in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface,
> > > >>> seems like
> > > >>>>>>>>>>>>>>> Produced
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> parameter is
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with
> > non-internal
> > > >>> (e.g
> > > >>>>>>>>>>>>> topics
> > > >>>>>>>>>>>>>>> created
> > > >>>>>>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> managed
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics
> > and I
> > > >>> think we
> > > >>>>>>>>>>>>>>> should leave
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> it as
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it is
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in that case.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into
> > account,
> > > >>> I think we
> > > >>>>>>>>>>>>>>> should
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> distinguish
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka
> > > >>> Streams should
> > > >>>>>>>>>>>>>>> create and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> manage
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs
> > > >>> topics that
> > > >>>>>>>>>>>>>>> should be
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> created in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding
> > > >>> numPartitions
> > > >>>>>>>>>>>>>>> configuration in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and
> > producer
> > > >>> level
> > > >>>>>>>>>>>>>>> configuration in
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> one
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API
> > > >>> semantics.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name
> > optional
> > > >>> in
> > > >>>>>>>>>>>>>>> KStream#through - I
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> think
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and
> > giving
> > > >>> users
> > > >>>>>>>>>>>>>>> possibility to
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> specify
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more
> > useful :)
> > > >>>>>>>>>>>>> Considering
> > > >>>>>>>>>>>>>>> arguments
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> against
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced
> > > >>> class, I see two
> > > >>>>>>>>>>>>>>> options
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> here:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
> > > >>> auto-generated and
> > > >>>>>>>>>>>>>>> num of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> partitions
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
> > numOfPartitions)
> > > >>> - topic
> > > >>>>>>>>>>>>> will
> > > >>>>>>>>>>>>>>> be auto
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of
> > partitions
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
> > numOfPartitions,
> > > >>> final
> > > >>>>>>>>>>>>>>> Produced<K, V>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with
> > generated
> > > >>> with
> > > >>>>>>>>>>>>>>> specified num of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
> > > >>> parameter.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it
> > is and
> > > >>> introduce
> > > >>>>>>>>>>>>>>> new method
> > > >>>>>>>>>>>>>>>>>>>>>>>> -
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias
> > > >>> suggested this
> > > >>>>>>>>>>>>>>> in one of
> > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> threads)
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I
> > > >>> propose the
> > > >>>>>>>>>>>>>>> following
> > > >>>>>>>>>>>>>>>>>>>>>>>> plan:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
> > > >>> configuration to
> > > >>>>>>>>>>>>> Grouped
> > > >>>>>>>>>>>>>>> class (as
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method
> > overloads to
> > > >>>>>>>>>>>>>>> KStream#through
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
> > > >>> auto-generated and
> > > >>>>>>>>>>>>>>> num of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> partitions
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
> > numOfPartitions)
> > > >>> - topic
> > > >>>>>>>>>>>>> will
> > > >>>>>>>>>>>>>>> be auto
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of
> > partitions
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
> > numOfPartitions,
> > > >>> final
> > > >>>>>>>>>>>>>>> Produced<K, V>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with
> > generated
> > > >>> with
> > > >>>>>>>>>>>>>>> specified num of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
> > > >>> parameter.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
> > > >>> configuration to
> > > >>>>>>>>>>>>> Grouped
> > > >>>>>>>>>>>>>>> class (as
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator
> > > >>> KStream#repartition for
> > > >>>>>>>>>>>>>>> creating and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> managing
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was
> > > >>> already
> > > >>>>>>>>>>>>>>> discussed in the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> mailing
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the
> > threads
> > > >>> that were
> > > >>>>>>>>>>>>>>> about this
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> KIP :(
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM,
> > Levani
> > > >>> Kokhreidze <
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> > > >>>>>>>>>>>>> levani.codes@gmail.com>>
> > > >>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect
> > discussion
> > > >>> around
> > > >>>>>>>>>>>>>>> KIP-221. Going
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> through
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems
> > to
> > > >>> agreement
> > > >>>>>>>>>>>>>>> around
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> usefulness of
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feature.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation,
> > as far
> > > >>> as I
> > > >>>>>>>>>>>>>>> understood, the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> most
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the
> > > >>> following:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to
> > > >>> KStream#through
> > > >>>>>>>>>>>>>>> method
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> (essentially
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with
> > > >>> numOfPartitions
> > > >>>>>>>>>>>>>>>>>>>>>>>> configuration
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> field.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL
> > > >>> users to
> > > >>>>>>>>>>>>> control
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> parallelism and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing
> > stateful
> > > >>>>>>>>>>>>> operations.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface
> > > >>> changes around
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> KStream#through if
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > > >>
> > > >
> > >
> >
> >

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by Sophie Blee-Goldman <so...@confluent.io>.
While I'm concerned that "not augmenting groupBy as part of this KIP"
really translates to "will not get around to augmenting groupBy for a long
time if not as part of this KIP", like I said I don't want to hold up the
new
.repartition operator that it seems we do, at least, all agree on. It's a
fair
point that we can always add this in later, but undoing it is far more
problematic.

Anyways, I would be happy if we at least make a ticket to consider adding a
"number of partitions" option/suggestion to groupBy, so that we don't lose
all the thought put in to this decision so far and can avoid rehashing the
same
argument word for word and have something to point to when someone
asks "why didn't we add this numPartitions option to groupBy".

Beyond that, if the community isn't pushing for it at this moment then it
seems very
reasonable to shelve the idea for now so that the rest of this KIP can
proceed.
Without input one way or another it's hard to say what the right thing to
do is,
which makes the right thing to do "wait to add this feature"

Thanks for the good discussion everyone,

Sophie

On Fri, Nov 15, 2019 at 12:41 PM John Roesler <jo...@confluent.io> wrote:

> Hi all,
>
> I think that Sophie is asking a good question, and I do think that
> such "blanket configurations" are plausible. For example, we currently
> support (and I would encourage) "I don't know if this is going to
> create a repartition topic, but if it does, then use this name instead
> of generating one".
>
> I'm not sure I'm convinced that specifying max parallelism falls into
> this category. After all, the groupByKey+aggregate will be executed
> with _some_ max parallelism. It's either the same as the inputs'
> partition count or overridden with the proposed config. It seems
> counterintuitive to override the specified option with the default
> value.
>
> I'm not sure if I can put my finger on it, but "maybe use this name"
> seems way more reasonable to me than "maybe execute with this degree
> of parallelism".
>
> I do think (and I appreciate that this is where Sophie's example is
> coming from) that Streams should strive to be absolutely as simple and
> intuitive as possible (while still maintaining correctness). Optimal
> performance can be at odds with API simplicity. For example, the
> simplest behavior is, if you ask for 5 partitions, you get 5
> partitions. Maybe a repartition is technically not necessary (if you
> didn't change the key), but at least there's no mystery to this
> behavior.
>
> Clearly, an (opposing) tenent of simplicity is trying to prevent
> people from making mistakes, which I think is what the example boils
> down to. Sometimes, we can prevent clear mistakes, like equi-joining
> two topics with different partition counts. But for this case, it
> doesn't seem as clear-cut to be able to assume that they _said_ 5
> partitions, but they didn't really _want_ 5 partitions. Maybe we can
> just try to be clear in the documentation, and also even log a warning
> when we parse the topology, "hey, I've been asked to repartition this
> stream, but it's not necessary".
>
> If anything, this discussion really supports to me the value in just
> sticking with `repartition()` for now, and deferring
> `groupBy[Key](partitions)` to the future.
>
> > Users should not have to choose between allowing Streams to optimize the
> repartition placement, and allowing to specify a number of partitions.
>
> This is a very fair point, and it may be something that we rapidly
> return to, but it seems safe for now to introduce the non-optimizable
> `reparition()` only, and then consider optimization options later.
> Skipping available optimizations will never break correctness, but
> adding optimizations can, so it makes sense to treat them with
> caution.
>
> In conclusion, I do think that a use _could_ want to "maybe specify"
> the partition count, but I also think we can afford to pass on
> supporting this right now.
>
> I'm open to continuing the discussion, but just to avoid ambiguity, I
> still feel we should _not_ change the groupBy[Key] operation at all,
> and we should only add `repartition()` as a non-optimizable operation.
>
> Thanks all,
> -John
>
> On Fri, Nov 15, 2019 at 11:26 AM Levani Kokhreidze
> <le...@gmail.com> wrote:
> >
> > Hello,
> >
> > Just fyi, PR was updated and now it incorporates the latest suggestions
> about joins.
> > `CopartitionedTopicsEnforcer` will throw an exception if number of
> partitions aren’t the same when using `repartition` operation along with
> `join`.
> >
> > For more details please take a look at the PR:
> https://github.com/apache/kafka/pull/7170/files <
> https://github.com/apache/kafka/pull/7170/files>
> >
> > Regards,
> > Levani
> >
> >
> > > On Nov 15, 2019, at 11:01 AM, Matthias J. Sax <ma...@confluent.io>
> wrote:
> > >
> > > Thanks a lot for the input Sophie.
> > >
> > > Your example is quite useful, and I would use it to support my claim
> > > that a "partition hint" for `Grouped` seems "useless" and does not
> > > improve the user experience.
> > >
> > > 1) You argue that a new user would be worries about repartitions topics
> > > with too many paritions. This would imply that a user is already
> > > advanced enough to understand the implication of repartitioning -- for
> > > this case, I would argue that a user also understand _when_ a
> > > auto-repartitioning would happen and thus the users understands where
> to
> > > insert a `repartition()` operation.
> > >
> > > 2) For specifying Serdes: if a `groupByKey()` does not trigger
> > > auto-repartitioning it's not required to specify the serdes and if they
> > > are specified they would be ignored/unused (note, that `groupBy()`
> would
> > > always trigger a repartitioning). Of course, if the default Serdes from
> > > the config match (eg, all data types are Json anyway), a user does not
> > > need to worry about specifying serdes. -- For new user that play
> around,
> > > I would assume that they work a lot with primitive types and thus would
> > > need to specify the serdes -- hence, they would learn about
> > > auto-repartitioning the hard way anyhow, because each time a
> > > `groupByKey()` does trigger auto-repartioning, they would need to pass
> > > in the correct Serdes -- this way, they would also be educated where to
> > > insert a `repartition()` operator if needed.
> > >
> > > 3) If a new user really just "plays around", I don't think they use an
> > > input topic with 100 partitions but most likely have a local single
> node
> > > broker with most likely single partitions topics.
> > >
> > >
> > > My main argument for my current proposal is however, that---based on
> > > past experience---it's better to roll out a new feature more carefully
> > > and see how it goes. Last, as John pointed out, we can still extend the
> > > feature in the future. Instead of making a judgment call up-front,
> being
> > > more conservative and less fancy, and revisit the design based on
> > > actuall user feedback after the first version is rolled out, seems to
> be
> > > the better option. Undoing a feature is must harder than extending it.
> > >
> > >
> > > While I advocate strong for a simple first version of this feature,
> it's
> > > a community decission in the end, and I would not block this KIP if
> > > there is a broad preference to add `Grouped#withNumberOfPartitions()`
> > > either.
> > >
> > >
> > > -Matthias
> > >
> > > On 11/14/19 11:35 PM, Sophie Blee-Goldman wrote:
> > >> It seems like we all agree at this point (please correct me if
> wrong!) that
> > >> we should NOT change
> > >> the existing repartitioning behavior, ie we should allow Streams to
> > >> continue to determine when and
> > >> where to repartition -- *unless* explicitly informed to by the use of
> a
> > >> .through or the new .repartition operator.
> > >>
> > >> Regarding groupBy, the existing behavior we should not disrupt is
> > >> a) repartition *only* when required due to upstream key-changing
> operation
> > >> (ie don't force repartitioning
> > >> based on the presence of an optional config parameter), and
> > >> b) allow optimization of required repartitions, if any
> > >>
> > >> Within the constraint of not breaking the existing behavior, this
> still
> > >> leaves open the question of whether we
> > >> want to improve the user experience by allowing to provide groupBy
> with a
> > >> *suggestion* for numPartitions (or to
> > >> put it more fairly, whether that *will* improve the experience). I
> agree
> > >> with many of the arguments outlined above but
> > >> let me just push back on this one issue one final time, and if we
> can't
> > >> come to a consensus then I am happy to drop
> > >> it for now so that the KIP can proceed.
> > >>
> > >> Specifically, my proposal would be to simply augment Grouped with an
> > >> optional numPartitions, understood to
> > >> indicate the user's desired number of partitions *if Streams decides
> to
> > >> repartition due to that groupBy*
> > >>
> > >>> if a user cares about the number of partition, the user wants to
> enforce
> > >> a repartitioning
> > >> First, I think we should take a step back and examine this claim. I
> agree
> > >> 100% that *if this is true,*
> > >> *then we should not give groupBy an optional numPartitions.* As far
> as I
> > >> see it, there's no argument
> > >> to be had there if we *presuppose that claim.* But I'm not convinced
> in
> > >> that as an axiom of the user
> > >> experience and think we should be examining that claim itself, not the
> > >> consequences of it.
> > >>
> > >> To give a simple example, let's say some new user is trying out
> Streams and
> > >> wants to just play around
> > >> with it to see if it might be worth looking into. They want to just
> write
> > >> up a simple app and test it out on the
> > >> data in some existing topics they have with a large number of
> partitions,
> > >> and a lot of data. They're just messing
> > >> around, trying new topologies and don't want to go through each new
> one
> > >> step by step to determine if (or where)
> > >> a repartition might be required. They also don't want to force a
> > >> repartition if it turns out to not be required, so they'd
> > >> like to avoid the nice new .repartition operator they saw. But given
> the
> > >> huge number of input partitions, they'd like
> > >> to rest assured that if a repartition does end up being required
> somewhere
> > >> during dev, it will not be created with
> > >> the same huge number of partitions that their input topic has -- so
> they
> > >> just pass groupBy a small numPartitions
> > >> suggestion.
> > >>
> > >> I know that's a bit of a contrived example but I think it does
> highlight
> > >> how and when this might be a considerable
> > >> quality of life improvement, in particular for new users to Streams
> and/or
> > >> during the dev cycle. *You don't want to*
> > >> *force a repartition if it wasn't necessary, but you don't want to
> create a
> > >> topic with a huge partition count either.*
> > >>
> > >> Also, while the optimization discussion took us down an interesting
> but
> > >> ultimately more distracting road, it's worth
> > >> pointing out that it is clearly a major win to have as few
> > >> repartition topics/steps as possible. Given that we
> > >> don't want to change existing behavior, the optimization framework
> can only
> > >> help out when the placement of
> > >> repartition steps is flexible, which means only those from .groupBy
> (and
> > >> not .repartition). *Users should not*
> > >> *have to choose between allowing Streams to optimize the repartition
> > >> placement, and allowing to specify a *
> > >> *number of partitions.*
> > >>
> > >> Lastly, I have what may be a stupid question but for my own
> edification of
> > >> how groupBy works:
> > >> if you do a .groupBy and a repartition is NOT required, does it ever
> need
> > >> to serialize/deserialize
> > >> any of the data? In other words, if you pass a key/value serde to
> groupBy
> > >> and it doesn't trigger
> > >> a repartition, is the serde(s) just ignored and thus more like a
> suggestion
> > >> than a requirement?
> > >>
> > >> So again, I don't want to hold up this KIP forever but I feel we've
> spent
> > >> some time getting slightly
> > >> off track (although certainly into very interesting discussions) yet
> never
> > >> really addressed or questioned
> > >> the basic premise: *could a user want to specify a number of
> partitions but
> > >> not enforce a repartition (at that*
> > >> *specific point in the topology)?*
> > >>
> > >>
> > >>
> > >> On Fri, Nov 15, 2019 at 12:18 AM Matthias J. Sax <
> matthias@confluent.io>
> > >> wrote:
> > >>
> > >>> Side remark:
> > >>>
> > >>> If the user specifies `repartition()` on both side of the join, we
> can
> > >>> actually throw the execption earlier, ie, when we build the topology.
> > >>>
> > >>> Current, we can do this check only after Kafka Streams was started,
> > >>> within `StreamPartitionAssignor#assign()` -- we still need to keep
> this
> > >>> check for the case that none or only one side has a user specified
> > >>> number of partitions though.
> > >>>
> > >>>
> > >>> -Matthias
> > >>>
> > >>> On 11/14/19 8:15 AM, John Roesler wrote:
> > >>>> Thanks, all,
> > >>>>
> > >>>> I can get behind just totally leaving out reparation-via-groupBy. If
> > >>>> we only introduce `repartition()` for now, we're making the minimal
> > >>>> change to gain the desired capability.
> > >>>>
> > >>>> Plus, since we agree that `repartition()` should never be
> optimizable,
> > >>>> it's a future-compatible proposal. I.e., if we were to add a
> > >>>> non-optimizable groupBy(partitions) operation now, and want to make
> it
> > >>>> optimizable in the future, we have to worry about topology
> > >>>> compatibility. Better to just do non-optimizable `repartition()`
> now,
> > >>>> and add an optimizable `groupBy(partitions)` in the future (maybe).
> > >>>>
> > >>>> About joins, yes, it's a concern, and IMO we should just do the same
> > >>>> thing we do now... check at runtime that the partition counts on
> both
> > >>>> sides match and throw an exception otherwise. What this means as a
> > >>>> user is that if you explicitly repartition the left side to 100
> > >>>> partitions, and then join with the right side at 10 partitions, you
> > >>>> get an exception, since this operation is not possible. You'd either
> > >>>> have to "step down" the left side again, back to 10 partitions, or
> you
> > >>>> could repartition the right side to 100 partitions before the join.
> > >>>> The choice has to be the user's, since it depends on their desired
> > >>>> execution parallelism.
> > >>>>
> > >>>> Thanks,
> > >>>> -John
> > >>>>
> > >>>> On Thu, Nov 14, 2019 at 12:55 AM Matthias J. Sax <
> matthias@confluent.io>
> > >>> wrote:
> > >>>>>
> > >>>>> Thanks a lot John. I think the way you decompose the operators is
> super
> > >>>>> helpful for this discussion.
> > >>>>>
> > >>>>> What you suggest with regard to using `Grouped` and enforcing
> > >>>>> repartitioning if the number of partitions is specified is
> certainly
> > >>>>> possible. However, I am not sure if we _should_ do this. My
> reasoning is
> > >>>>> that an enforce repartitioning as introduced via `repartition()`
> is an
> > >>>>> expensive operations, and it seems better to demand an more
> explicit
> > >>>>> user opt-in to trigger it. Just setting an optional parameter
> might be
> > >>>>> too subtle to trigger such a heavy "side effect".
> > >>>>>
> > >>>>> While I agree about "usability" in general, I would prefer a more
> > >>>>> conservative appraoch to introduce this feature, see how it goes,
> and
> > >>>>> maybe make it more advance later on. This also applies to what
> > >>>>> optimzation we may or may not allow (or are able to perform at
> all).
> > >>>>>
> > >>>>> @Levani: Reflecting about my suggestion about `Repartioned extends
> > >>>>> Grouped`, I agree that it might not be a good idea.
> > >>>>>
> > >>>>> Atm, I see an enforces repartitioning as non-optimizable and as a
> good
> > >>>>> first step and I would suggest to not intoruce anything else for
> now.
> > >>>>> Introducing optimizable enforce repartitioning via `groupBy(...,
> > >>>>> Grouped)` is something we could add later.
> > >>>>>
> > >>>>>
> > >>>>> Therefore, I would not change `Grouped` but only introduce
> > >>>>> `repartition()`. Users that use `grouBy()` atm, and want to opt-in
> to
> > >>>>> set the number of partitions, would need to rewrite their code to
> > >>>>> `selectKey(...).repartition(...).groupByKey()`. It's less
> convinient but
> > >>>>> also less risky from an API and optimization point of view.
> > >>>>>
> > >>>>>
> > >>>>> @Levani: about joins -> yes, we will need to check the specified
> number
> > >>>>> of partitions (if any) and if they don't match, throw an
> exception. We
> > >>>>> can discuss this on the PR -- I am just trying to get the PR for
> KIP-466
> > >>>>> merged -- your is next on the list :)
> > >>>>>
> > >>>>>
> > >>>>> Thoughts?
> > >>>>>
> > >>>>>
> > >>>>> -Matthias
> > >>>>>
> > >>>>>
> > >>>>> On 11/12/19 4:51 PM, Levani Kokhreidze wrote:
> > >>>>>> Thank you all for an interesting discussion. This is very
> enlightening.
> > >>>>>>
> > >>>>>> Thank you Matthias for your explanation. Your arguments are very
> true.
> > >>> It makes sense that if user specifies number of partitions he/she
> really
> > >>> cares that those specifications are applied to internal topics.
> > >>>>>> Unfortunately, in current implementation this is not true during
> > >>> `join` operation. As I’ve written in the PR comment, currently, when
> > >>> `Stream#join` is used, `CopartitionedTopicsEnforcer` chooses max
> number of
> > >>> partitions from the two source topics.
> > >>>>>> I’m not really sure what would be the other way around this
> situation.
> > >>> Maybe fail the stream altogether and inform the user to specify same
> number
> > >>> of partitions?
> > >>>>>> Or we should treat join operations in a same way as it is right
> now
> > >>> and basically choose max number of partitions even when `repartition`
> > >>> operation is specified, because Kafka Streams “knows the best” how to
> > >>> handle joins?
> > >>>>>> You can check integration tests how it’s being handled currently.
> Open
> > >>> to suggestions on that part.
> > >>>>>>
> > >>>>>> As for groupBy, I agree and John raised very interesting points.
> My
> > >>> arguments for allowing users to specify number of partitions during
> groupBy
> > >>> operations mainly was coming from the usability perspective.
> > >>>>>> So building on top of what John said, maybe it makes sense to make
> > >>> `groupBy` operations smarter and whenever user specifies
> > >>> `numberOfPartitions` configuration, repartitioning will be enforced,
> wdyt?
> > >>>>>> I’m not going into optimization part yet :) I think it will be
> part of
> > >>> separate PR and task, but overall it makes sense to apply
> optimizations
> > >>> where number of partitions are the same.
> > >>>>>>
> > >>>>>> As for Repartitioned extending Grouped, I kinda feel that it
> won’t fit
> > >>> nicely in current API design.
> > >>>>>> In addition, in the PR review, John mentioned that there were a
> lot of
> > >>> troubles in the past trying to use one operation's configuration
> objects on
> > >>> other operations.
> > >>>>>> Also it makes sense to keep them separate in terms of
> compatibility.
> > >>>>>> In that case, we don’t have to worry every time Grouped is
> changed,
> > >>> what would be the implications on `repartition` operations.
> > >>>>>>
> > >>>>>> Kind regards,
> > >>>>>> Levani
> > >>>>>>
> > >>>>>>
> > >>>>>>> On Nov 11, 2019, at 9:13 PM, John Roesler <jo...@confluent.io>
> wrote:
> > >>>>>>>
> > >>>>>>> Ah, thanks for the clarification. I missed your point.
> > >>>>>>>
> > >>>>>>> I like the framework you've presented. It does seem simpler to
> assume
> > >>>>>>> that they either care about the partition count and want to
> > >>>>>>> repartition to realize it, or they don't care about the number.
> > >>>>>>> Returning to this discussion, it does seem unlikely that they
> care
> > >>>>>>> about the number and _don't_ care if it actually gets realized.
> > >>>>>>>
> > >>>>>>> But then, it still seems like we can just keep the option as
> part of
> > >>>>>>> Grouped. As in:
> > >>>>>>>
> > >>>>>>> // user does not care
> > >>>>>>> stream.groupByKey(Grouped /*not specifying partition count*/)
> > >>>>>>> stream.groupBy(Grouped /*not specifying partition count*/)
> > >>>>>>>
> > >>>>>>> // user does care
> > >>>>>>> stream.repartition(Repartitioned)
> > >>>>>>> stream.groupByKey(Grouped.numberOfPartitions(...))
> > >>>>>>> stream.groupBy(Grouped.numberOfPartitions(...))
> > >>>>>>>
> > >>>>>>> ----
> > >>>>>>>
> > >>>>>>> The above discussion got me thinking about algebra. Matthias is
> > >>>>>>> absolutely right that `groupByKey(numPartitions)` is equivalent
> to
> > >>>>>>> `repartition(numPartitions).groupByKey()`. I'm just not
> convinced that
> > >>>>>>> we should force people to apply that expansion themselves vs.
> having a
> > >>>>>>> more compact way to express it if they don't care where exactly
> the
> > >>>>>>> repartition occurs. However, thinking about these operators
> > >>>>>>> algebraically can really help *us* narrow down the number of
> different
> > >>>>>>> expressions we have to consider.
> > >>>>>>>
> > >>>>>>> Let's consider some identities:
> > >>>>>>>
> > >>>>>>> A: groupBy(mapper) + agg = mapKey(mapper) + groupByKey + agg
> > >>>>>>> B: src + ... + groupByKey + agg = src + ... + passthough + agg
> > >>>>>>> C: mapKey(mapper) + ... + groupByKey + agg
> > >>>>>>> = mapKey(mapper) + ... + repartition + groupByKey + agg
> > >>>>>>> D: repartition = sink(managed) + src
> > >>>>>>>
> > >>>>>>> In these identities, I used one special identifier (...), which
> means
> > >>>>>>> any number (0+) of operations that are not src, mapKey,
> groupBy[Key],
> > >>>>>>> repartition, or agg.
> > >>>>>>>
> > >>>>>>> For mental clarity, I'm just going to make up a rule that groupBy
> > >>>>>>> operations are not executable. In other words, you have to get
> to a
> > >>>>>>> point where you can apply B to convert a groupByKey into a
> passthough
> > >>>>>>> in order to execute the program. This is just a formal way of
> stating
> > >>>>>>> what already happens in Kafka Streams.
> > >>>>>>>
> > >>>>>>> By applying A, we can just completely leave `groupBy` out of our
> > >>>>>>> analysis. It trivially decomposes into a mapKey followed by a
> > >>>>>>> groupByKey.
> > >>>>>>>
> > >>>>>>> Then, we can eliminate the "repartition required" case of
> `groupByKey`
> > >>>>>>> by applying C followed by D to get to the "no repartition
> required"
> > >>>>>>> version of groupByKey, which in turn sets us up to apply B to
> get an
> > >>>>>>> executable topology.
> > >>>>>>>
> > >>>>>>> Fundamentally, you can think about KIP-221 is as proposing a
> modified
> > >>>>>>> D identity in which you can specify the partition count of the
> managed
> > >>>>>>> sink topic:
> > >>>>>>> D': repartition(pc) = sink(managed w/ pc) + src
> > >>>>>>>
> > >>>>>>> Since users _could_ apply the identities above, we don't
> actually have
> > >>>>>>> to add any partition count to groupBy[Key], but we decided early
> on in
> > >>>>>>> the KIP discussion that it's more ergonomic to add it. In that
> case,
> > >>>>>>> we also have to modify A and C:
> > >>>>>>> A': groupBy(mapper, pc) + agg
> > >>>>>>> = mapKey(mapper) + groupByKey(pc) + agg
> > >>>>>>> C': mapKey(mapper) + ... + groupByKey(pc) + agg
> > >>>>>>> = mapKey(mapper) + ... + repartition(pc) + groupByKey + agg
> > >>>>>>>
> > >>>>>>> Which sets us up still to always be able to get back to a plain
> > >>>>>>> `groupByKey` operation (with no `(pc)`) and then apply D' and
> > >>>>>>> ultimately B to get an executable topology.
> > >>>>>>>
> > >>>>>>> What about the optimizer?
> > >>>>>>> The optimizer applies another set of graph-algebraic identities
> to
> > >>>>>>> minimize the number of repartition topics in a topology.
> > >>>>>>>
> > >>>>>>> (forgive my ascii art)
> > >>>>>>>
> > >>>>>>> E: (merging repartition nodes)
> > >>>>>>> (...) -> repartition -> X
> > >>>>>>> \-> repartition -> Y
> > >>>>>>> =
> > >>>>>>> (... + repartition) -> X
> > >>>>>>>    \-> Y
> > >>>>>>> F: (reordering around repartition)
> > >>>>>>> Where SVO is any non-key-changing, stateless, operation:
> > >>>>>>> repartition -> SVO = SVO -> repartition
> > >>>>>>>
> > >>>>>>> In terms of these identities, what the optimizer does is apply F
> > >>>>>>> repeatedly in either direction to a topology to factor out
> common in
> > >>>>>>> branches so that it can apply E to merge repartition nodes. This
> was
> > >>>>>>> especially necessary before KIP-221 because you couldn't directly
> > >>>>>>> express `repartition` in the DSL, only indirectly via
> `groupBy[Key]`,
> > >>>>>>> so there was no way to do the factoring manually.
> > >>>>>>>
> > >>>>>>> We can now state very clearly that in KIP-221, explicit
> > >>>>>>> `repartition()` operators should create a "reordering barrier".
> So, F
> > >>>>>>> cannot be applied to an explicit `repartition()`. Also, I think
> we
> > >>>>>>> decided earlier that explicit `repartition()` operations would
> also be
> > >>>>>>> ineligible for merging, so E can't be applied to explicit
> > >>>>>>> `repartition()` operations either. I think we feel we _could_
> apply E
> > >>>>>>> without harm, but we want to be conservative for now.
> > >>>>>>>
> > >>>>>>> I think the salient point from the latter discussion has been
> that
> > >>>>>>> when you use `Grouped.numberOfPartitions`, this does _not_
> constitute
> > >>>>>>> an explicit `repartition()` operator, and therefore, the
> resulting
> > >>>>>>> repartition node remains eligible for optimization.
> > >>>>>>>
> > >>>>>>> To be clear, I agree with Matthias that the provided partition
> count
> > >>>>>>> _must_ be used in the resulting implicit repartition. This has
> some
> > >>>>>>> implications for E. Namely, E could only be applied to two
> repartition
> > >>>>>>> nodes that have the same partition count. This has always been
> > >>>>>>> trivially true before KIP-221 because the partition count has
> always
> > >>>>>>> been "unspecified", i.e., it would be determined at runtime by
> the
> > >>>>>>> user-managed-topics' partition counts. Now, it could be
> specified or
> > >>>>>>> unspecified. We can simply augment E to allow merging only
> repartition
> > >>>>>>> nodes where the partition count is EITHER "specified and the
> same on
> > >>>>>>> both sides", OR "unspecified on both sides".
> > >>>>>>>
> > >>>>>>> Sorry for the long email, but I have a hope that it builds a
> solid
> > >>>>>>> theoretical foundation for our decisions in KIP-221, so we can
> have
> > >>>>>>> confidence that there are no edge cases for design flaws to hide.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> -John
> > >>>>>>>
> > >>>>>>> On Sat, Nov 9, 2019 at 10:37 PM Matthias J. Sax <
> > >>> matthias@confluent.io> wrote:
> > >>>>>>>>
> > >>>>>>>>> it seems like we do want to allow
> > >>>>>>>>>> people to optionally specify a partition count as part of this
> > >>>>>>>>>> operation, but we don't want that option to _force_
> repartitioning
> > >>>>>>>>
> > >>>>>>>> Correct, ie, that is my suggestions.
> > >>>>>>>>
> > >>>>>>>>> "Use P partitions if repartitioning is necessary"
> > >>>>>>>>
> > >>>>>>>> I disagree here, because my reasoning is that:
> > >>>>>>>>
> > >>>>>>>> - if a user cares about the number of partition, the user wants
> to
> > >>>>>>>> enforce a repartitioning
> > >>>>>>>> - if a user does not case about the number of partitions, we
> don't
> > >>> need
> > >>>>>>>> to provide them a way to pass in a "hint"
> > >>>>>>>>
> > >>>>>>>> Hence, it should be sufficient to support:
> > >>>>>>>>
> > >>>>>>>> // user does not care
> > >>>>>>>>
> > >>>>>>>> `stream.groupByKey(Grouped)`
> > >>>>>>>> `stream.grouBy(..., Grouped)`
> > >>>>>>>>
> > >>>>>>>> // user does care
> > >>>>>>>>
> > >>>>>>>> `stream.repartition(Repartitioned).groupByKey()`
> > >>>>>>>> `streams.groupBy(..., Repartitioned)`
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> -Matthias
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On 11/9/19 8:10 PM, John Roesler wrote:
> > >>>>>>>>> Thanks for those thoughts, Matthias,
> > >>>>>>>>>
> > >>>>>>>>> I find your reasoning about the optimization behavior
> compelling.
> > >>> The
> > >>>>>>>>> `through` operation is very simple and clear to reason about.
> It
> > >>> just
> > >>>>>>>>> passes the data exactly at the specified point in the topology
> > >>> exactly
> > >>>>>>>>> through the specified topic. Likewise, if a user invokes a
> > >>>>>>>>> `repartition` operator, the simplest behavior is if we just do
> what
> > >>>>>>>>> they asked for.
> > >>>>>>>>>
> > >>>>>>>>> Stepping back to think about when optimizations are surprising
> and
> > >>>>>>>>> when they aren't, it occurs to me that we should be free to
> move
> > >>>>>>>>> around repartitions when users have asked to perform some
> operation
> > >>>>>>>>> that implies a repartition, like "change keys, then filter,
> then
> > >>>>>>>>> aggregate". This program requires a repartition, but it could
> be
> > >>>>>>>>> anywhere between the key change and the aggregation. On the
> other
> > >>>>>>>>> hand, if they say, "change keys, then filter, then
> repartition, then
> > >>>>>>>>> aggregate", it seems like they were pretty clear about their
> desire,
> > >>>>>>>>> and we should just take it at face value.
> > >>>>>>>>>
> > >>>>>>>>> So, I'm sold on just literally doing a repartition every time
> they
> > >>>>>>>>> invoke the `repartition` operator.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> The "partition count" modifier for `groupBy`/`groupByKey` is
> more
> > >>> nuanced.
> > >>>>>>>>>
> > >>>>>>>>> What you said about `groupByKey` makes sense. If they key
> hasn't
> > >>>>>>>>> actually changed, then we don't need to repartition before
> > >>>>>>>>> aggregating. On the other hand, `groupBy` is specifically
> changing
> > >>> the
> > >>>>>>>>> key as part of the grouping operation, so (as you said) we
> > >>> definitely
> > >>>>>>>>> have to do a repartition.
> > >>>>>>>>>
> > >>>>>>>>> If I'm reading the discussion right, it seems like we do want
> to
> > >>> allow
> > >>>>>>>>> people to optionally specify a partition count as part of this
> > >>>>>>>>> operation, but we don't want that option to _force_
> repartitioning
> > >>> if
> > >>>>>>>>> it's not needed. That last clause is the key. "Use P
> partitions if
> > >>>>>>>>> repartitioning is necessary" is a directive that applies
> cleanly and
> > >>>>>>>>> correctly to both `groupBy` and `groupByKey`. What if we call
> the
> > >>>>>>>>> option `numberOfPartitionsHint`, which along with the "if
> necessary"
> > >>>>>>>>> javadoc, should make it clear that the option won't force a
> > >>>>>>>>> repartition, and also gives us enough latitude to still employ
> the
> > >>>>>>>>> optimizer on those repartition topics?
> > >>>>>>>>>
> > >>>>>>>>> If we like the idea of expressing it as a "hint" for grouping
> and a
> > >>>>>>>>> "command" for `repartition`, then it seems like it still makes
> sense
> > >>>>>>>>> to keep Grouped and Repartitioned separate, as they would
> actually
> > >>>>>>>>> offer different methods with distinct semantics.
> > >>>>>>>>>
> > >>>>>>>>> WDYT?
> > >>>>>>>>>
> > >>>>>>>>> Thanks,
> > >>>>>>>>> -John
> > >>>>>>>>>
> > >>>>>>>>> On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <
> > >>> matthias@confluent.io> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Sorry for late reply.
> > >>>>>>>>>>
> > >>>>>>>>>> I guess, the question boils down to the intended semantics of
> > >>>>>>>>>> `repartition()`. My understanding is as follows:
> > >>>>>>>>>>
> > >>>>>>>>>> - KS does auto-repartitioning for correctness reasons (using
> the
> > >>>>>>>>>> upstream topic to determine the number of partitions)
> > >>>>>>>>>> - KS does auto-repartitioning only for downstream DSL
> operators
> > >>> like
> > >>>>>>>>>> `count()` (eg, a `transform()` does never trigger an
> > >>> auto-repartitioning
> > >>>>>>>>>> even if the stream is marked as `repartitioningRequired`).
> > >>>>>>>>>> - KS offers `through()` to enforce a repartitioning --
> however,
> > >>> the user
> > >>>>>>>>>> needs to create the topic manually (with the desired number of
> > >>> partitions).
> > >>>>>>>>>>
> > >>>>>>>>>> I see two main applications for `repartitioning()`:
> > >>>>>>>>>>
> > >>>>>>>>>> 1) repartition data before a `transform()` but user does not
> want
> > >>> to
> > >>>>>>>>>> manage the topic
> > >>>>>>>>>> 2) scale out a downstream subtopology
> > >>>>>>>>>>
> > >>>>>>>>>> Hence, I see `repartition()` similar to `through()`: if a
> users
> > >>> calls
> > >>>>>>>>>> it, a repartitining is enforced, with the difference that KS
> > >>> manages the
> > >>>>>>>>>> topic and the user does not need to create it.
> > >>>>>>>>>>
> > >>>>>>>>>> This behavior makes (1) and (2) possible.
> > >>>>>>>>>>
> > >>>>>>>>>>> I think many users would prefer to just say "if there *is* a
> > >>> repartition
> > >>>>>>>>>>> required at this point in the topology, it should
> > >>>>>>>>>>> have N partitions"
> > >>>>>>>>>>
> > >>>>>>>>>> Because of (2), I disagree. Either a user does not care about
> > >>> scaling
> > >>>>>>>>>> out, for which case she would not specify the number of
> > >>> partitions. Or a
> > >>>>>>>>>> user does care, and hence wants to enforce the scale out. I
> don't
> > >>> think
> > >>>>>>>>>> that any user would say, "maybe scale out".
> > >>>>>>>>>>
> > >>>>>>>>>> Therefore, the optimizer should never ignore the repartition
> > >>> operation.
> > >>>>>>>>>> As a "consequence" (because repartitioning is expensive) a
> user
> > >>> should
> > >>>>>>>>>> make an explicit call to `repartition()` IMHO -- piggybacking
> an
> > >>>>>>>>>> enforced repartitioning into `groupByKey()` seems to be
> "dangerous"
> > >>>>>>>>>> because it might be too subtle and an "optional scaling out"
> as
> > >>> laid out
> > >>>>>>>>>> above does not make sense IMHO.
> > >>>>>>>>>>
> > >>>>>>>>>> I am also not worried about "over repartitioning" because the
> > >>> result
> > >>>>>>>>>> stream would never trigger auto-repartitioning. Only if
> multiple
> > >>>>>>>>>> consecutive calls to `repartition()` are made it could be bad
> --
> > >>> but
> > >>>>>>>>>> that's the same with `through()`. In the end, there is always
> some
> > >>>>>>>>>> responsibility on the user.
> > >>>>>>>>>>
> > >>>>>>>>>> Btw, for `.groupBy()` we know that repartitioning will be
> required,
> > >>>>>>>>>> however, for `groupByKey()` it depends if the KStream is
> marked as
> > >>>>>>>>>> `repartitioningRequired`.
> > >>>>>>>>>>
> > >>>>>>>>>> Hence, for `groupByKey()` it should not be possible for a
> user to
> > >>> set
> > >>>>>>>>>> number of partitions IMHO. For `groupBy()` it's a different
> story,
> > >>>>>>>>>> because calling
> > >>>>>>>>>>
> > >>>>>>>>>>  `repartition().groupBy()`
> > >>>>>>>>>>
> > >>>>>>>>>> does not achieve what we want. Hence, allowing users to pass
> in the
> > >>>>>>>>>> number of users partitions into `groupBy()` does actually
> makes
> > >>> sense,
> > >>>>>>>>>> because repartitioning will happen anyway and thus we can
> > >>> piggyback a
> > >>>>>>>>>> scaling decision.
> > >>>>>>>>>>
> > >>>>>>>>>> I think that John has a fair concern about the overloads,
> however,
> > >>> I am
> > >>>>>>>>>> not convinced that using `Grouped` to specify the number of
> > >>> partitions
> > >>>>>>>>>> is intuitive. I double checked `Grouped` and `Repartitioned`
> and
> > >>> both
> > >>>>>>>>>> allow to specify a `name` and `keySerde/valueSerde`. Thus, I
> am
> > >>>>>>>>>> wondering if we could bridge the gap between both, if we
> would make
> > >>>>>>>>>> `Repartitioned extends Grouped`? For this case, we only need
> > >>>>>>>>>> `groupBy(Grouped)` and a user can pass in both types what
> seems to
> > >>> make
> > >>>>>>>>>> the API quite smooth:
> > >>>>>>>>>>
> > >>>>>>>>>> `stream.groupBy(..., Grouped...)`
> > >>>>>>>>>>
> > >>>>>>>>>> `stream.groupBy(..., Repartitioned...)`
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Thoughts?
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> -Matthias
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
> > >>>>>>>>>>> Hi Sophie,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thank you for your reply, very insightful. Looking forward
> > >>> hearing others opinion as well on this.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Kind regards,
> > >>>>>>>>>>> Levani
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <
> > >>> sophie@confluent.io> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
> > >>> other hand
> > >>>>>>>>>>>> Kafka Streams has already
> > >>>>>>>>>>>>> optimizer in place which alters topology independently
> from user
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I agree (with you) and think this is a good way to put it
> -- we
> > >>> currently
> > >>>>>>>>>>>> auto-repartition for the user so
> > >>>>>>>>>>>> that they don't have to walk through their entire topology
> and
> > >>> reason about
> > >>>>>>>>>>>> when and where to place a
> > >>>>>>>>>>>> `.through` (or the new `.repartition`), so why suddenly
> force
> > >>> this onto the
> > >>>>>>>>>>>> user? How certain are we that
> > >>>>>>>>>>>> users will always get this right? It's easy to imagine that
> > >>> during
> > >>>>>>>>>>>> development, you write your new app with
> > >>>>>>>>>>>> correctly placed repartitions in order to use this new
> feature.
> > >>> During the
> > >>>>>>>>>>>> course of development you end up
> > >>>>>>>>>>>> tweaking the topology, but don't remember to review or move
> the
> > >>>>>>>>>>>> repartitioning since you're used to Streams
> > >>>>>>>>>>>> doing this for you. If you use only single-partition topics
> for
> > >>> testing,
> > >>>>>>>>>>>> you might not even notice your app is
> > >>>>>>>>>>>> spitting out incorrect results!
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Anyways, I feel pretty strongly that it would be weird to
> > >>> introduce a new
> > >>>>>>>>>>>> feature and say that to use it, you can't take
> > >>>>>>>>>>>> advantage of this other feature anymore. Also, is it
> possible our
> > >>>>>>>>>>>> optimization framework could ever include an
> > >>>>>>>>>>>> optimized repartitioning strategy that is better than what a
> > >>> user could
> > >>>>>>>>>>>> achieve by manually inserting repartitions?
> > >>>>>>>>>>>> Do we expect users to have a deep understanding of the best
> way
> > >>> to
> > >>>>>>>>>>>> repartition their particular topology, or is it
> > >>>>>>>>>>>> likely they will end up over-repartitioning either due to
> missed
> > >>>>>>>>>>>> optimizations or unnecessary extra repartitions?
> > >>>>>>>>>>>> I think many users would prefer to just say "if there *is* a
> > >>> repartition
> > >>>>>>>>>>>> required at this point in the topology, it should
> > >>>>>>>>>>>> have N partitions"
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> As to the idea of adding `numberOfPartitions` to Grouped
> rather
> > >>> than
> > >>>>>>>>>>>> adding a new parameter to groupBy, that does seem more in
> line
> > >>> with the
> > >>>>>>>>>>>> current syntax so +1 from me
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <
> > >>> levani.codes@gmail.com>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hello all,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> While https://github.com/apache/kafka/pull/7170 <
> > >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> is under
> review and
> > >>> it’s
> > >>>>>>>>>>>>> almost done, I want to resurrect discussion about this KIP
> to
> > >>> address
> > >>>>>>>>>>>>> couple of concerns raised by Matthias and John.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> As a reminder, idea of the KIP-221 was to allow DSL users
> > >>> control over
> > >>>>>>>>>>>>> repartitioning and parallelism of sub-topologies by:
> > >>>>>>>>>>>>> 1) Introducing new KStream#repartition operation which is
> done
> > >>> in
> > >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
> > >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>
> > >>>>>>>>>>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which
> is
> > >>> planned to
> > >>>>>>>>>>>>> be separate PR.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> While all agree about general implementation and idea
> behind
> > >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
> > >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> PR,
> introducing new
> > >>>>>>>>>>>>> KStream#groupBy(Repartitioned) method overload raised some
> > >>> questions during
> > >>>>>>>>>>>>> the review.
> > >>>>>>>>>>>>> Matthias raised concern that there can be cases when user
> uses
> > >>>>>>>>>>>>> `KStream#groupBy(Repartitioned)` operation, but actual
> > >>> repartitioning may
> > >>>>>>>>>>>>> not required, thus configuration passed via `Repartitioned`
> > >>> would never be
> > >>>>>>>>>>>>> applied (Matthias, please correct me if I misinterpreted
> your
> > >>> comment).
> > >>>>>>>>>>>>> So instead, if user wants to control parallelism of
> > >>> sub-topologies, he or
> > >>>>>>>>>>>>> she should always use `KStream#repartition` operation
> before
> > >>> groupBy. Full
> > >>>>>>>>>>>>> comment can be seen here:
> > >>>>>>>>>>>>>
> > >>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
> > >>>>>>>>>>>>>
> > >>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On the same topic, John pointed out that, from API design
> > >>> perspective, we
> > >>>>>>>>>>>>> shouldn’t intertwine configuration classes of different
> > >>> operators between
> > >>>>>>>>>>>>> one another. So instead of introducing new
> > >>> `KStream#groupBy(Repartitioned)`
> > >>>>>>>>>>>>> for specifying number of partitions for internal topic, we
> > >>> should update
> > >>>>>>>>>>>>> existing `Grouped` class with `numberOfPartitions` field.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
> > >>> other hand
> > >>>>>>>>>>>>> Kafka Streams has already optimizer in place which alters
> > >>> topology
> > >>>>>>>>>>>>> independently from user. So maybe it makes sense if Kafka
> > >>> Streams,
> > >>>>>>>>>>>>> internally would optimize topology in the best way
> possible,
> > >>> even if in
> > >>>>>>>>>>>>> some cases this means ignoring some operator configurations
> > >>> passed by the
> > >>>>>>>>>>>>> user. Also, I agree with John about API design semantics.
> If we
> > >>> go through
> > >>>>>>>>>>>>> with the changes for `KStream#groupBy` operation, it makes
> more
> > >>> sense to
> > >>>>>>>>>>>>> add `numberOfPartitions` field to `Grouped` class instead
> of
> > >>> introducing
> > >>>>>>>>>>>>> new `KStream#groupBy(Repartitioned)` method overload.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I would really appreciate communities feedback on this.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Kind regards,
> > >>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <
> > >>> sophie@confluent.io>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hey Levani,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I think people are busy with the upcoming 2.4 release, and
> > >>> don't have
> > >>>>>>>>>>>>> much
> > >>>>>>>>>>>>>> spare time at the
> > >>>>>>>>>>>>>> moment. It's kind of a difficult time to get attention on
> > >>> things, but
> > >>>>>>>>>>>>> feel
> > >>>>>>>>>>>>>> free to pick up something else
> > >>>>>>>>>>>>>> to work on in the meantime until things have calmed down
> a bit!
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>>>> Sophie
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
> > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Hello all,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Sorry for bringing this thread again, but I would like
> to get
> > >>> some
> > >>>>>>>>>>>>>>> attention on this PR:
> > >>> https://github.com/apache/kafka/pull/7170 <
> > >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> <
> > >>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
> > >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>>
> > >>>>>>>>>>>>>>> It's been a while now and I would love to move on to
> other
> > >>> KIPs as well.
> > >>>>>>>>>>>>>>> Please let me know if you have any concerns.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
> > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Here’s voting thread for this KIP:
> > >>>>>>>>>>>>>>>
> > >>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
> > >>>>>>>>>>>>>
> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>
> > >>> <
> > >>>>>>>>>>>>>>>
> > >>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
> > >>>>>>>>>>>>>
> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html
> > >>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
> > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> > >>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
> levani.codes@gmail.com>>>
> > >>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Hi Matthias,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion
> on
> > >>> that one.
> > >>>>>>>>>>>>>>>>> Agree that avoiding unnecessary method overloads is a
> good
> > >>> idea.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Updated KIP
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <
> > >>> matthias@confluent.io
> > >>>>>>>>>>>>> <ma...@confluent.io>
> > >>>>>>>>>>>>>>> <mailto:matthias@confluent.io <mailto:
> matthias@confluent.io>>>
> > >>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> One question:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Why do we add
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Repartitioned#with(final String name, final int
> > >>> numberOfPartitions)
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> It seems that `#with(String name)`,
> > >>> `#numberOfPartitions(int)` in
> > >>>>>>>>>>>>>>>>>> combination with `withName()` and
> > >>> `withNumberOfPartitions()` should
> > >>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>> sufficient. Users can chain the method calls.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> (I think it's valuable to keep the number of overload
> > >>> small if
> > >>>>>>>>>>>>>>> possible.)
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Otherwise LGTM.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> -Matthias
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
> > >>>>>>>>>>>>>>>>>>> Hello,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Thanks all for your feedback.
> > >>>>>>>>>>>>>>>>>>> I started voting procedure for this KIP. If there’re
> any
> > >>> other
> > >>>>>>>>>>>>>>> concerns about this KIP, please let me know.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
> > >>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> > >>> <mailto:
> > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>
> wrote:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Hi Matthias,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, makes sense.
> > >>>>>>>>>>>>>>>>>>>> I’ve updated KIP (
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>> <
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>> <
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>> <
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>> <
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> ).
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
> > >>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>
> > >>>>>>>>>>>>>>> <mailto:matthias@confluent.io <mailto:
> matthias@confluent.io>>
> > >>> <mailto:
> > >>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>
> <mailto:
> > >>>>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>>>>
> > >>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Thanks for driving the KIP.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> I agree that users need to be able to specify a
> > >>> partitioning
> > >>>>>>>>>>>>>>> strategy.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Sophie raises a fair point about topic configs and
> > >>> producer
> > >>>>>>>>>>>>>>> configs. My
> > >>>>>>>>>>>>>>>>>>>>> take is, that consider `Repartitioned` as an
> > >>> "extension" to
> > >>>>>>>>>>>>>>> `Produced`,
> > >>>>>>>>>>>>>>>>>>>>> that adds topic configuration, is a good way to
> think
> > >>> about it and
> > >>>>>>>>>>>>>>> helps
> > >>>>>>>>>>>>>>>>>>>>> to keep the API "clean".
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> With regard to method names. I would prefer to
> avoid
> > >>>>>>>>>>>>> abbreviations.
> > >>>>>>>>>>>>>>> Can
> > >>>>>>>>>>>>>>>>>>>>> we rename:
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Furthermore, it might be good to add some more
> `static`
> > >>> methods:
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
> > >>>>>>>>>>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
> > >>>>>>>>>>>>>>>>>>>>> -
> Repartitioned.streamPartitioner(StreamPartitioner)
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> -Matthias
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
> > >>>>>>>>>>>>>>>>>>>>>> Totally agree. I think in KStream interface it
> makes
> > >>> sense to
> > >>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>> some duplicate configurations between operators in order
> to
> > >>> keep API
> > >>>>>>>>>>>>> simple
> > >>>>>>>>>>>>>>> and usable.
> > >>>>>>>>>>>>>>>>>>>>>> Also, as more surface API has, harder it is to
> have
> > >>> proper
> > >>>>>>>>>>>>>>> backward compatibility.
> > >>>>>>>>>>>>>>>>>>>>>> While initial idea of keeping topic level configs
> > >>> separate was
> > >>>>>>>>>>>>>>> exciting, having Repartitioned class encapsulate some
> > >>> producer level
> > >>>>>>>>>>>>>>> configs makes API more readable.
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman
> <
> > >>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
> <mailto:
> > >>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> > >>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
> <mailto:
> > >>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> I think that is a good point about trying to keep
> > >>> producer level
> > >>>>>>>>>>>>>>>>>>>>>>> configurations and (repartition) topic level
> > >>> considerations
> > >>>>>>>>>>>>>>> separate.
> > >>>>>>>>>>>>>>>>>>>>>>> Number of partitions is definitely purely a topic
> > >>> level
> > >>>>>>>>>>>>>>> configuration. But
> > >>>>>>>>>>>>>>>>>>>>>>> on some level, serdes and partitioners are just
> as
> > >>> much a topic
> > >>>>>>>>>>>>>>>>>>>>>>> configuration as a producer one. You could have
> two
> > >>> producers
> > >>>>>>>>>>>>>>> configured
> > >>>>>>>>>>>>>>>>>>>>>>> with different serdes and/or partitioners, but if
> > >>> they are
> > >>>>>>>>>>>>>>> writing to the
> > >>>>>>>>>>>>>>>>>>>>>>> same topic the result would be very difficult to
> > >>> part. So in a
> > >>>>>>>>>>>>>>> sense, these
> > >>>>>>>>>>>>>>>>>>>>>>> are configurations of topics in Streams, not just
> > >>> producers.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Another way to think of it: while the Streams
> API is
> > >>> not always
> > >>>>>>>>>>>>>>> true to
> > >>>>>>>>>>>>>>>>>>>>>>> this, ideally all the relevant configs for an
> > >>> operator are
> > >>>>>>>>>>>>>>> wrapped into a
> > >>>>>>>>>>>>>>>>>>>>>>> single object (in this case, Repartitioned). We
> could
> > >>> instead
> > >>>>>>>>>>>>>>> split out the
> > >>>>>>>>>>>>>>>>>>>>>>> fields in common with Produced into a separate
> > >>> parameter to keep
> > >>>>>>>>>>>>>>> topic and
> > >>>>>>>>>>>>>>>>>>>>>>> producer level configurations separate, but this
> > >>> increases the
> > >>>>>>>>>>>>>>> API surface
> > >>>>>>>>>>>>>>>>>>>>>>> area by a lot. It's much more straightforward to
> just
> > >>> say "this
> > >>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>>>>> everything that this particular operator needs"
> > >>> without worrying
> > >>>>>>>>>>>>>>> about what
> > >>>>>>>>>>>>>>>>>>>>>>> exactly you're specifying.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> I suppose you could alternatively make Produced a
> > >>> field of
> > >>>>>>>>>>>>>>> Repartitioned,
> > >>>>>>>>>>>>>>>>>>>>>>> but I don't think we do this kind of composition
> > >>> elsewhere in
> > >>>>>>>>>>>>>>> Streams at
> > >>>>>>>>>>>>>>>>>>>>>>> the moment
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani
> Kokhreidze <
> > >>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> > >>> <mailto:
> > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
> > >>> <mailto:
> > >>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> > >>> <mailto:
> > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Hi Bill,
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot for the feedback.
> > >>>>>>>>>>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
> > >>>>>>>>>>>>>>> `Repartitioned#partitioner`
> > >>>>>>>>>>>>>>>>>>>>>>>> configuration.
> > >>>>>>>>>>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class
> for
> > >>> topic level
> > >>>>>>>>>>>>>>>>>>>>>>>> configuration and keep topic level and producer
> level
> > >>>>>>>>>>>>>>> configurations (such
> > >>>>>>>>>>>>>>>>>>>>>>>> as Produced) separately (see my second email in
> this
> > >>> thread).
> > >>>>>>>>>>>>>>>>>>>>>>>> But while looking at the semantics of KStream
> > >>> interface, I
> > >>>>>>>>>>>>>>> couldn’t really
> > >>>>>>>>>>>>>>>>>>>>>>>> figure out good operation name for Topic level
> > >>> configuration
> > >>>>>>>>>>>>>>> class and just
> > >>>>>>>>>>>>>>>>>>>>>>>> introducing `Topic` config class was kinda
> breaking
> > >>> the
> > >>>>>>>>>>>>>>> semantics.
> > >>>>>>>>>>>>>>>>>>>>>>>> So I think having Repartitioned class which
> > >>> encapsulates topic
> > >>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>> producer level configurations for internal
> topics is
> > >>> viable
> > >>>>>>>>>>>>>>> thing to do.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <
> > >>> bbejeck@gmail.com
> > >>>>>>>>>>>>> <ma...@gmail.com>
> > >>>>>>>>>>>>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>>
> > >>> <mailto:
> > >>>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
> > >>>>>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Hi Lavani,
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.
> In
> > >>> addition to
> > >>>>>>>>>>>>>>> the reason
> > >>>>>>>>>>>>>>>>>>>>>>>>> provided by John, my reasoning is:
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> 1. Users may want to use something other than
> > >>> hash-based
> > >>>>>>>>>>>>>>> partitioning
> > >>>>>>>>>>>>>>>>>>>>>>>>> 2. Users may wish to partition on something
> > >>> different than the
> > >>>>>>>>>>>>>>> key
> > >>>>>>>>>>>>>>>>>>>>>>>>> without having to change the key.  For example:
> > >>>>>>>>>>>>>>>>>>>>>>>>> 1. A combination of fields in the value in
> > >>> conjunction with
> > >>>>>>>>>>>>>>> the key
> > >>>>>>>>>>>>>>>>>>>>>>>>> 2. Something other than the key
> > >>>>>>>>>>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on
> > >>> Produced hence
> > >>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes
> sense
> > >>> for API
> > >>>>>>>>>>>>>>> consistency.
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Just my  2 cents.
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>>>>>> Bill
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani
> Kokhreidze <
> > >>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> > >>> levani.codes@gmail.com>
> > >>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
> levani.codes@gmail.com>>
> > >>> <mailto:
> > >>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> > >>> <mailto:
> > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Hi John,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> In my mind it makes sense.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> If we add partitioner configuration to
> > >>> Repartitioned class,
> > >>>>>>>>>>>>>>> with the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> combination of specifying number of
> partitions for
> > >>> internal
> > >>>>>>>>>>>>>>> topics, user
> > >>>>>>>>>>>>>>>>>>>>>>>>>> will have opportunity to ensure
> co-partitioning
> > >>> before join
> > >>>>>>>>>>>>>>> operation.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Wondering what others think about this?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
> > >>>>>>>>>>>>> john@confluent.io <ma...@confluent.io>
> > >>>>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>
> > >>> <mailto:
> > >>>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
> > >>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind.
> Again,
> > >>> not totally
> > >>>>>>>>>>>>>>> sure it
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar
> is
> > >>> the
> > >>>>>>>>>>>>> rationale
> > >>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> -John
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani
> Kokhreidze
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
> > >>> levani.codes@gmail.com>
> > >>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
> levani.codes@gmail.com>>
> > >>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
> levani.codes@gmail.com>
> > >>> <mailto:
> > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> > >>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey John,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your
> example
> > >>> I would
> > >>>>>>>>>>>>>>> first issue
> > >>>>>>>>>>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper
> partitioner
> > >>> that
> > >>>>>>>>>>>>>>> essentially
> > >>>>>>>>>>>>>>>>>>>>>>>> would
> > >>>>>>>>>>>>>>>>>>>>>>>>>> be the same as the topic I want to join with
> and
> > >>> then do the
> > >>>>>>>>>>>>>>>>>>>>>>>> KStream#join
> > >>>>>>>>>>>>>>>>>>>>>>>>>> with DSL?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler
> <
> > >>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
> > >>> john@confluent.io
> > >>>>>>>>>>>>> <ma...@confluent.io>> <mailto:john@confluent.io
> <mailto:
> > >>>>>>>>>>>>> john@confluent.io>
> > >>>>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an
> option to
> > >>> specify
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is
> that
> > >>> some data may
> > >>>>>>>>>>>>>>> get
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input
> > >>> topic. If the
> > >>>>>>>>>>>>>>> right-side
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning
> > >>> strategy, then the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be
> > >>> partitioned with the
> > >>>>>>>>>>>>>>> same
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss
> > >>> something
> > >>>>>>>>>>>>>>> important?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> -John
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani
> > >>> Kokhreidze
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
> > >>> levani.codes@gmail.com>
> > >>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
> levani.codes@gmail.com>>
> > >>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
> levani.codes@gmail.com>
> > >>> <mailto:
> > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> > >>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To
> be
> > >>> honest I’m
> > >>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>> sure
> > >>>>>>>>>>>>>>>>>>>>>>>> about
> > >>>>>>>>>>>>>>>>>>>>>>>>>> it yet.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really
> > >>> think I would
> > >>>>>>>>>>>>>>> need control
> > >>>>>>>>>>>>>>>>>>>>>>>>>> over partitioner for internal topics.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka
> Streams
> > >>> knows best
> > >>>>>>>>>>>>>>> how to
> > >>>>>>>>>>>>>>>>>>>>>>>>>> partition data for internal topics.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should
> be
> > >>> used only for
> > >>>>>>>>>>>>>>> topics
> > >>>>>>>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>> are created by user In advance.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have
> > >>> possibility to
> > >>>>>>>>>>>>>>> specify
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> partitioner.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet,
> but I
> > >>> guess
> > >>>>>>>>>>>>>>> specifying the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s
> > >>> agreement on
> > >>>>>>>>>>>>> this.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie
> > >>> Blee-Goldman <
> > >>>>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <mailto:
> sophie@confluent.io>
> > >>> <mailto:
> > >>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> > >>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that
> > >>> Repartitioned
> > >>>>>>>>>>>>>>> would be a
> > >>>>>>>>>>>>>>>>>>>>>>>>>> useful
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also
> need
> > >>> to have
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field,
> similar
> > >>> to
> > >>>>>>>>>>>>>>> Produced? I'm not
> > >>>>>>>>>>>>>>>>>>>>>>>>>> sure how
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but
> seems
> > >>> it should
> > >>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>>>>> available
> > >>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartition topics.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani
> > >>> Kokhreidze <
> > >>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> > >>> levani.codes@gmail.com>
> > >>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
> levani.codes@gmail.com>
> > >>> <mailto:
> > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
> > >>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by
> Kafka
> > >>> Streams.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user
> more
> > >>> control over
> > >>>>>>>>>>>>>>> the topic
> > >>>>>>>>>>>>>>>>>>>>>>>>>> such as
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num of partitions.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is
> > >>> something that
> > >>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>> missing
> > >>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current DSL design.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over
> > >>> parallelism by
> > >>>>>>>>>>>>>>> configuring
> > >>>>>>>>>>>>>>>>>>>>>>>> num
> > >>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie
> > >>> Blee-Goldman <
> > >>>>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <mailto:
> sophie@confluent.io>
> > >>> <mailto:
> > >>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> > >>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one
> > >>> thing for me
> > >>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>> for the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a
> > >>> Repartitioned,
> > >>>>>>>>>>>>>>> will the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> topic be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems
> to be
> > >>> the case
> > >>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> signature
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it
> have to
> > >>> be
> > >>>>>>>>>>>>>>> pre-created? The
> > >>>>>>>>>>>>>>>>>>>>>>>>>> wording
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version
> of
> > >>> the method
> > >>>>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>>>>>>>>>>> auto-create
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sophie
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani
> > >>> Kokhreidze <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> > >>> levani.codes@gmail.com>
> > >>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <mailto:
> levani.codes@gmail.com>
> > >>> <mailto:
> > >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>> <
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>> <
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>> <
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>> <
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>> <
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> )
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing
> list :)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities
> > >>> opinions/concerns
> > >>>>>>>>>>>>> about
> > >>>>>>>>>>>>>>> this KIP.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani
> > >>> Kokhreidze <
> > >>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani
> > >>> Kokhreidze <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>>
> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward,
> I’ve
> > >>> updated
> > >>>>>>>>>>>>>>> confluence
> > >>>>>>>>>>>>>>>>>>>>>>>> page
> > >>>>>>>>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected
> Alternatives”
> > >>> section.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP
> :)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani
> > >>> Kokhreidze <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>>
> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing
> > >>> dedicated `Topic`
> > >>>>>>>>>>>>>>> class for
> > >>>>>>>>>>>>>>>>>>>>>>>>>> topic
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators
> like
> > >>>>>>>>>>>>> `groupedBy`.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others
> opinion
> > >>> about this
> > >>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>> well.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM,
> Matthias
> > >>> J. Sax <
> > >>>>>>>>>>>>>>>>>>>>>>>>>> matthias@confluent.io
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>>
> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP!
> And
> > >>> thanks for
> > >>>>>>>>>>>>>>> summarizing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been
> > >>> discussed
> > >>>>>>>>>>>>>>> already (can't
> > >>>>>>>>>>>>>>>>>>>>>>>>>> really
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a
> good
> > >>> summary to
> > >>>>>>>>>>>>>>> refresh the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes
> sense.
> > >>> With regard
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> distinction
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage
> topics
> > >>> and
> > >>>>>>>>>>>>> operators
> > >>>>>>>>>>>>>>> that use
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user-created
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument,
> it
> > >>> might
> > >>>>>>>>>>>>> indicate
> > >>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>> leaving
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator
> that
> > >>> uses
> > >>>>>>>>>>>>>>> use-defined
> > >>>>>>>>>>>>>>>>>>>>>>>>>> topics) and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()`
> > >>> operator (an
> > >>>>>>>>>>>>>>> operator that
> > >>>>>>>>>>>>>>>>>>>>>>>>>> manages
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good.
> > >>> Otherwise, there is
> > >>>>>>>>>>>>>>> one
> > >>>>>>>>>>>>>>>>>>>>>>>> operator
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages
> > >>> topics but
> > >>>>>>>>>>>>>>> sometimes
> > >>>>>>>>>>>>>>>>>>>>>>>> not; a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make
> the
> > >>> distinction
> > >>>>>>>>>>>>>>> clearer.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to
> > >>> `Grouped`. I am
> > >>>>>>>>>>>>>>> wondering
> > >>>>>>>>>>>>>>>>>>>>>>>>>> if the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does
> apply
> > >>> and adding
> > >>>>>>>>>>>>>>> it is
> > >>>>>>>>>>>>>>>>>>>>>>>>>> semantically
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get
> > >>> opinions of
> > >>>>>>>>>>>>>>> others on
> > >>>>>>>>>>>>>>>>>>>>>>>>>> this, too.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> am
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I
> prefer
> > >>> atm.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration
> objects
> > >>> that allow
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>> configure
> > >>>>>>>>>>>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer,
> producer,
> > >>> store. If we
> > >>>>>>>>>>>>>>> assume that
> > >>>>>>>>>>>>>>>>>>>>>>>>>> a topic
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if
> we
> > >>> should have a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()`
> class
> > >>> and method
> > >>>>>>>>>>>>>>> instead of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> a plain
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to
> add
> > >>> other
> > >>>>>>>>>>>>> configs,
> > >>>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>>>>>>>> replication
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc,
> easily,
> > >>> without the
> > >>>>>>>>>>>>>>> need to
> > >>>>>>>>>>>>>>>>>>>>>>>>>> change the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "main
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not
> sure
> > >>> if I like
> > >>>>>>>>>>>>>>> them
> > >>>>>>>>>>>>>>>>>>>>>>>> myself.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> :)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani
> Kokhreidze
> > >>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though -
> > >>> maybe
> > >>>>>>>>>>>>> enhancing
> > >>>>>>>>>>>>>>> Produced
> > >>>>>>>>>>>>>>>>>>>>>>>>>> with num
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the
> > >>> best approach.
> > >>>>>>>>>>>>>>> Let me
> > >>>>>>>>>>>>>>>>>>>>>>>>>> explain
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class
> with
> > >>> this
> > >>>>>>>>>>>>>>> configuration,
> > >>>>>>>>>>>>>>>>>>>>>>>>>> this will
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation.
> Since
> > >>> KStream#to is
> > >>>>>>>>>>>>>>> the final
> > >>>>>>>>>>>>>>>>>>>>>>>>>> sink of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be
> reasonable
> > >>>>>>>>>>>>> assumption
> > >>>>>>>>>>>>>>> that user
> > >>>>>>>>>>>>>>>>>>>>>>>>>> needs
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in
> advance. And
> > >>> in that
> > >>>>>>>>>>>>>>> case, having
> > >>>>>>>>>>>>>>>>>>>>>>>>>> num of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make
> much
> > >>> sense.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class,
> based
> > >>> on API
> > >>>>>>>>>>>>>>> contract, seems
> > >>>>>>>>>>>>>>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something
> that
> > >>> is
> > >>>>>>>>>>>>>>> explicitly for
> > >>>>>>>>>>>>>>>>>>>>>>>>>> producer
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (key
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer,
> partitioner
> > >>> those all
> > >>>>>>>>>>>>>>> are producer
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions
> is
> > >>> topic level
> > >>>>>>>>>>>>>>>>>>>>>>>>>> configuration. And
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer
> level
> > >>>>>>>>>>>>>>> configurations
> > >>>>>>>>>>>>>>>>>>>>>>>>>> together in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface,
> > >>> seems like
> > >>>>>>>>>>>>>>> Produced
> > >>>>>>>>>>>>>>>>>>>>>>>>>> parameter is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with
> non-internal
> > >>> (e.g
> > >>>>>>>>>>>>> topics
> > >>>>>>>>>>>>>>> created
> > >>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> managed
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics
> and I
> > >>> think we
> > >>>>>>>>>>>>>>> should leave
> > >>>>>>>>>>>>>>>>>>>>>>>>>> it as
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in that case.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into
> account,
> > >>> I think we
> > >>>>>>>>>>>>>>> should
> > >>>>>>>>>>>>>>>>>>>>>>>>>> distinguish
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka
> > >>> Streams should
> > >>>>>>>>>>>>>>> create and
> > >>>>>>>>>>>>>>>>>>>>>>>>>> manage
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs
> > >>> topics that
> > >>>>>>>>>>>>>>> should be
> > >>>>>>>>>>>>>>>>>>>>>>>>>> created in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding
> > >>> numPartitions
> > >>>>>>>>>>>>>>> configuration in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and
> producer
> > >>> level
> > >>>>>>>>>>>>>>> configuration in
> > >>>>>>>>>>>>>>>>>>>>>>>>>> one
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API
> > >>> semantics.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name
> optional
> > >>> in
> > >>>>>>>>>>>>>>> KStream#through - I
> > >>>>>>>>>>>>>>>>>>>>>>>>>> think
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and
> giving
> > >>> users
> > >>>>>>>>>>>>>>> possibility to
> > >>>>>>>>>>>>>>>>>>>>>>>>>> specify
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more
> useful :)
> > >>>>>>>>>>>>> Considering
> > >>>>>>>>>>>>>>> arguments
> > >>>>>>>>>>>>>>>>>>>>>>>>>> against
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced
> > >>> class, I see two
> > >>>>>>>>>>>>>>> options
> > >>>>>>>>>>>>>>>>>>>>>>>>>> here:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
> > >>> auto-generated and
> > >>>>>>>>>>>>>>> num of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> partitions
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
> numOfPartitions)
> > >>> - topic
> > >>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>> be auto
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of
> partitions
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
> numOfPartitions,
> > >>> final
> > >>>>>>>>>>>>>>> Produced<K, V>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with
> generated
> > >>> with
> > >>>>>>>>>>>>>>> specified num of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
> > >>> parameter.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it
> is and
> > >>> introduce
> > >>>>>>>>>>>>>>> new method
> > >>>>>>>>>>>>>>>>>>>>>>>> -
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias
> > >>> suggested this
> > >>>>>>>>>>>>>>> in one of
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> threads)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I
> > >>> propose the
> > >>>>>>>>>>>>>>> following
> > >>>>>>>>>>>>>>>>>>>>>>>> plan:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
> > >>> configuration to
> > >>>>>>>>>>>>> Grouped
> > >>>>>>>>>>>>>>> class (as
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method
> overloads to
> > >>>>>>>>>>>>>>> KStream#through
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
> > >>> auto-generated and
> > >>>>>>>>>>>>>>> num of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> partitions
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
> numOfPartitions)
> > >>> - topic
> > >>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>> be auto
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of
> partitions
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int
> numOfPartitions,
> > >>> final
> > >>>>>>>>>>>>>>> Produced<K, V>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with
> generated
> > >>> with
> > >>>>>>>>>>>>>>> specified num of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
> > >>> parameter.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
> > >>> configuration to
> > >>>>>>>>>>>>> Grouped
> > >>>>>>>>>>>>>>> class (as
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator
> > >>> KStream#repartition for
> > >>>>>>>>>>>>>>> creating and
> > >>>>>>>>>>>>>>>>>>>>>>>>>> managing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was
> > >>> already
> > >>>>>>>>>>>>>>> discussed in the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> mailing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the
> threads
> > >>> that were
> > >>>>>>>>>>>>>>> about this
> > >>>>>>>>>>>>>>>>>>>>>>>>>> KIP :(
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM,
> Levani
> > >>> Kokhreidze <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> > >>>>>>>>>>>>> levani.codes@gmail.com>>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect
> discussion
> > >>> around
> > >>>>>>>>>>>>>>> KIP-221. Going
> > >>>>>>>>>>>>>>>>>>>>>>>>>> through
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems
> to
> > >>> agreement
> > >>>>>>>>>>>>>>> around
> > >>>>>>>>>>>>>>>>>>>>>>>>>> usefulness of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feature.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation,
> as far
> > >>> as I
> > >>>>>>>>>>>>>>> understood, the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> most
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the
> > >>> following:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to
> > >>> KStream#through
> > >>>>>>>>>>>>>>> method
> > >>>>>>>>>>>>>>>>>>>>>>>>>> (essentially
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with
> > >>> numOfPartitions
> > >>>>>>>>>>>>>>>>>>>>>>>> configuration
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> field.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL
> > >>> users to
> > >>>>>>>>>>>>> control
> > >>>>>>>>>>>>>>>>>>>>>>>>>> parallelism and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing
> stateful
> > >>>>>>>>>>>>> operations.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface
> > >>> changes around
> > >>>>>>>>>>>>>>>>>>>>>>>>>> KStream#through if
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>>
> > >>
> > >
> >
>
>

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by John Roesler <jo...@confluent.io>.
Hi all,

I think that Sophie is asking a good question, and I do think that
such "blanket configurations" are plausible. For example, we currently
support (and I would encourage) "I don't know if this is going to
create a repartition topic, but if it does, then use this name instead
of generating one".

I'm not sure I'm convinced that specifying max parallelism falls into
this category. After all, the groupByKey+aggregate will be executed
with _some_ max parallelism. It's either the same as the inputs'
partition count or overridden with the proposed config. It seems
counterintuitive to override the specified option with the default
value.

I'm not sure if I can put my finger on it, but "maybe use this name"
seems way more reasonable to me than "maybe execute with this degree
of parallelism".

I do think (and I appreciate that this is where Sophie's example is
coming from) that Streams should strive to be absolutely as simple and
intuitive as possible (while still maintaining correctness). Optimal
performance can be at odds with API simplicity. For example, the
simplest behavior is, if you ask for 5 partitions, you get 5
partitions. Maybe a repartition is technically not necessary (if you
didn't change the key), but at least there's no mystery to this
behavior.

Clearly, an (opposing) tenent of simplicity is trying to prevent
people from making mistakes, which I think is what the example boils
down to. Sometimes, we can prevent clear mistakes, like equi-joining
two topics with different partition counts. But for this case, it
doesn't seem as clear-cut to be able to assume that they _said_ 5
partitions, but they didn't really _want_ 5 partitions. Maybe we can
just try to be clear in the documentation, and also even log a warning
when we parse the topology, "hey, I've been asked to repartition this
stream, but it's not necessary".

If anything, this discussion really supports to me the value in just
sticking with `repartition()` for now, and deferring
`groupBy[Key](partitions)` to the future.

> Users should not have to choose between allowing Streams to optimize the repartition placement, and allowing to specify a number of partitions.

This is a very fair point, and it may be something that we rapidly
return to, but it seems safe for now to introduce the non-optimizable
`reparition()` only, and then consider optimization options later.
Skipping available optimizations will never break correctness, but
adding optimizations can, so it makes sense to treat them with
caution.

In conclusion, I do think that a use _could_ want to "maybe specify"
the partition count, but I also think we can afford to pass on
supporting this right now.

I'm open to continuing the discussion, but just to avoid ambiguity, I
still feel we should _not_ change the groupBy[Key] operation at all,
and we should only add `repartition()` as a non-optimizable operation.

Thanks all,
-John

On Fri, Nov 15, 2019 at 11:26 AM Levani Kokhreidze
<le...@gmail.com> wrote:
>
> Hello,
>
> Just fyi, PR was updated and now it incorporates the latest suggestions about joins.
> `CopartitionedTopicsEnforcer` will throw an exception if number of partitions aren’t the same when using `repartition` operation along with `join`.
>
> For more details please take a look at the PR: https://github.com/apache/kafka/pull/7170/files <https://github.com/apache/kafka/pull/7170/files>
>
> Regards,
> Levani
>
>
> > On Nov 15, 2019, at 11:01 AM, Matthias J. Sax <ma...@confluent.io> wrote:
> >
> > Thanks a lot for the input Sophie.
> >
> > Your example is quite useful, and I would use it to support my claim
> > that a "partition hint" for `Grouped` seems "useless" and does not
> > improve the user experience.
> >
> > 1) You argue that a new user would be worries about repartitions topics
> > with too many paritions. This would imply that a user is already
> > advanced enough to understand the implication of repartitioning -- for
> > this case, I would argue that a user also understand _when_ a
> > auto-repartitioning would happen and thus the users understands where to
> > insert a `repartition()` operation.
> >
> > 2) For specifying Serdes: if a `groupByKey()` does not trigger
> > auto-repartitioning it's not required to specify the serdes and if they
> > are specified they would be ignored/unused (note, that `groupBy()` would
> > always trigger a repartitioning). Of course, if the default Serdes from
> > the config match (eg, all data types are Json anyway), a user does not
> > need to worry about specifying serdes. -- For new user that play around,
> > I would assume that they work a lot with primitive types and thus would
> > need to specify the serdes -- hence, they would learn about
> > auto-repartitioning the hard way anyhow, because each time a
> > `groupByKey()` does trigger auto-repartioning, they would need to pass
> > in the correct Serdes -- this way, they would also be educated where to
> > insert a `repartition()` operator if needed.
> >
> > 3) If a new user really just "plays around", I don't think they use an
> > input topic with 100 partitions but most likely have a local single node
> > broker with most likely single partitions topics.
> >
> >
> > My main argument for my current proposal is however, that---based on
> > past experience---it's better to roll out a new feature more carefully
> > and see how it goes. Last, as John pointed out, we can still extend the
> > feature in the future. Instead of making a judgment call up-front, being
> > more conservative and less fancy, and revisit the design based on
> > actuall user feedback after the first version is rolled out, seems to be
> > the better option. Undoing a feature is must harder than extending it.
> >
> >
> > While I advocate strong for a simple first version of this feature, it's
> > a community decission in the end, and I would not block this KIP if
> > there is a broad preference to add `Grouped#withNumberOfPartitions()`
> > either.
> >
> >
> > -Matthias
> >
> > On 11/14/19 11:35 PM, Sophie Blee-Goldman wrote:
> >> It seems like we all agree at this point (please correct me if wrong!) that
> >> we should NOT change
> >> the existing repartitioning behavior, ie we should allow Streams to
> >> continue to determine when and
> >> where to repartition -- *unless* explicitly informed to by the use of a
> >> .through or the new .repartition operator.
> >>
> >> Regarding groupBy, the existing behavior we should not disrupt is
> >> a) repartition *only* when required due to upstream key-changing operation
> >> (ie don't force repartitioning
> >> based on the presence of an optional config parameter), and
> >> b) allow optimization of required repartitions, if any
> >>
> >> Within the constraint of not breaking the existing behavior, this still
> >> leaves open the question of whether we
> >> want to improve the user experience by allowing to provide groupBy with a
> >> *suggestion* for numPartitions (or to
> >> put it more fairly, whether that *will* improve the experience). I agree
> >> with many of the arguments outlined above but
> >> let me just push back on this one issue one final time, and if we can't
> >> come to a consensus then I am happy to drop
> >> it for now so that the KIP can proceed.
> >>
> >> Specifically, my proposal would be to simply augment Grouped with an
> >> optional numPartitions, understood to
> >> indicate the user's desired number of partitions *if Streams decides to
> >> repartition due to that groupBy*
> >>
> >>> if a user cares about the number of partition, the user wants to enforce
> >> a repartitioning
> >> First, I think we should take a step back and examine this claim. I agree
> >> 100% that *if this is true,*
> >> *then we should not give groupBy an optional numPartitions.* As far as I
> >> see it, there's no argument
> >> to be had there if we *presuppose that claim.* But I'm not convinced in
> >> that as an axiom of the user
> >> experience and think we should be examining that claim itself, not the
> >> consequences of it.
> >>
> >> To give a simple example, let's say some new user is trying out Streams and
> >> wants to just play around
> >> with it to see if it might be worth looking into. They want to just write
> >> up a simple app and test it out on the
> >> data in some existing topics they have with a large number of partitions,
> >> and a lot of data. They're just messing
> >> around, trying new topologies and don't want to go through each new one
> >> step by step to determine if (or where)
> >> a repartition might be required. They also don't want to force a
> >> repartition if it turns out to not be required, so they'd
> >> like to avoid the nice new .repartition operator they saw. But given the
> >> huge number of input partitions, they'd like
> >> to rest assured that if a repartition does end up being required somewhere
> >> during dev, it will not be created with
> >> the same huge number of partitions that their input topic has -- so they
> >> just pass groupBy a small numPartitions
> >> suggestion.
> >>
> >> I know that's a bit of a contrived example but I think it does highlight
> >> how and when this might be a considerable
> >> quality of life improvement, in particular for new users to Streams and/or
> >> during the dev cycle. *You don't want to*
> >> *force a repartition if it wasn't necessary, but you don't want to create a
> >> topic with a huge partition count either.*
> >>
> >> Also, while the optimization discussion took us down an interesting but
> >> ultimately more distracting road, it's worth
> >> pointing out that it is clearly a major win to have as few
> >> repartition topics/steps as possible. Given that we
> >> don't want to change existing behavior, the optimization framework can only
> >> help out when the placement of
> >> repartition steps is flexible, which means only those from .groupBy (and
> >> not .repartition). *Users should not*
> >> *have to choose between allowing Streams to optimize the repartition
> >> placement, and allowing to specify a *
> >> *number of partitions.*
> >>
> >> Lastly, I have what may be a stupid question but for my own edification of
> >> how groupBy works:
> >> if you do a .groupBy and a repartition is NOT required, does it ever need
> >> to serialize/deserialize
> >> any of the data? In other words, if you pass a key/value serde to groupBy
> >> and it doesn't trigger
> >> a repartition, is the serde(s) just ignored and thus more like a suggestion
> >> than a requirement?
> >>
> >> So again, I don't want to hold up this KIP forever but I feel we've spent
> >> some time getting slightly
> >> off track (although certainly into very interesting discussions) yet never
> >> really addressed or questioned
> >> the basic premise: *could a user want to specify a number of partitions but
> >> not enforce a repartition (at that*
> >> *specific point in the topology)?*
> >>
> >>
> >>
> >> On Fri, Nov 15, 2019 at 12:18 AM Matthias J. Sax <ma...@confluent.io>
> >> wrote:
> >>
> >>> Side remark:
> >>>
> >>> If the user specifies `repartition()` on both side of the join, we can
> >>> actually throw the execption earlier, ie, when we build the topology.
> >>>
> >>> Current, we can do this check only after Kafka Streams was started,
> >>> within `StreamPartitionAssignor#assign()` -- we still need to keep this
> >>> check for the case that none or only one side has a user specified
> >>> number of partitions though.
> >>>
> >>>
> >>> -Matthias
> >>>
> >>> On 11/14/19 8:15 AM, John Roesler wrote:
> >>>> Thanks, all,
> >>>>
> >>>> I can get behind just totally leaving out reparation-via-groupBy. If
> >>>> we only introduce `repartition()` for now, we're making the minimal
> >>>> change to gain the desired capability.
> >>>>
> >>>> Plus, since we agree that `repartition()` should never be optimizable,
> >>>> it's a future-compatible proposal. I.e., if we were to add a
> >>>> non-optimizable groupBy(partitions) operation now, and want to make it
> >>>> optimizable in the future, we have to worry about topology
> >>>> compatibility. Better to just do non-optimizable `repartition()` now,
> >>>> and add an optimizable `groupBy(partitions)` in the future (maybe).
> >>>>
> >>>> About joins, yes, it's a concern, and IMO we should just do the same
> >>>> thing we do now... check at runtime that the partition counts on both
> >>>> sides match and throw an exception otherwise. What this means as a
> >>>> user is that if you explicitly repartition the left side to 100
> >>>> partitions, and then join with the right side at 10 partitions, you
> >>>> get an exception, since this operation is not possible. You'd either
> >>>> have to "step down" the left side again, back to 10 partitions, or you
> >>>> could repartition the right side to 100 partitions before the join.
> >>>> The choice has to be the user's, since it depends on their desired
> >>>> execution parallelism.
> >>>>
> >>>> Thanks,
> >>>> -John
> >>>>
> >>>> On Thu, Nov 14, 2019 at 12:55 AM Matthias J. Sax <ma...@confluent.io>
> >>> wrote:
> >>>>>
> >>>>> Thanks a lot John. I think the way you decompose the operators is super
> >>>>> helpful for this discussion.
> >>>>>
> >>>>> What you suggest with regard to using `Grouped` and enforcing
> >>>>> repartitioning if the number of partitions is specified is certainly
> >>>>> possible. However, I am not sure if we _should_ do this. My reasoning is
> >>>>> that an enforce repartitioning as introduced via `repartition()` is an
> >>>>> expensive operations, and it seems better to demand an more explicit
> >>>>> user opt-in to trigger it. Just setting an optional parameter might be
> >>>>> too subtle to trigger such a heavy "side effect".
> >>>>>
> >>>>> While I agree about "usability" in general, I would prefer a more
> >>>>> conservative appraoch to introduce this feature, see how it goes, and
> >>>>> maybe make it more advance later on. This also applies to what
> >>>>> optimzation we may or may not allow (or are able to perform at all).
> >>>>>
> >>>>> @Levani: Reflecting about my suggestion about `Repartioned extends
> >>>>> Grouped`, I agree that it might not be a good idea.
> >>>>>
> >>>>> Atm, I see an enforces repartitioning as non-optimizable and as a good
> >>>>> first step and I would suggest to not intoruce anything else for now.
> >>>>> Introducing optimizable enforce repartitioning via `groupBy(...,
> >>>>> Grouped)` is something we could add later.
> >>>>>
> >>>>>
> >>>>> Therefore, I would not change `Grouped` but only introduce
> >>>>> `repartition()`. Users that use `grouBy()` atm, and want to opt-in to
> >>>>> set the number of partitions, would need to rewrite their code to
> >>>>> `selectKey(...).repartition(...).groupByKey()`. It's less convinient but
> >>>>> also less risky from an API and optimization point of view.
> >>>>>
> >>>>>
> >>>>> @Levani: about joins -> yes, we will need to check the specified number
> >>>>> of partitions (if any) and if they don't match, throw an exception. We
> >>>>> can discuss this on the PR -- I am just trying to get the PR for KIP-466
> >>>>> merged -- your is next on the list :)
> >>>>>
> >>>>>
> >>>>> Thoughts?
> >>>>>
> >>>>>
> >>>>> -Matthias
> >>>>>
> >>>>>
> >>>>> On 11/12/19 4:51 PM, Levani Kokhreidze wrote:
> >>>>>> Thank you all for an interesting discussion. This is very enlightening.
> >>>>>>
> >>>>>> Thank you Matthias for your explanation. Your arguments are very true.
> >>> It makes sense that if user specifies number of partitions he/she really
> >>> cares that those specifications are applied to internal topics.
> >>>>>> Unfortunately, in current implementation this is not true during
> >>> `join` operation. As I’ve written in the PR comment, currently, when
> >>> `Stream#join` is used, `CopartitionedTopicsEnforcer` chooses max number of
> >>> partitions from the two source topics.
> >>>>>> I’m not really sure what would be the other way around this situation.
> >>> Maybe fail the stream altogether and inform the user to specify same number
> >>> of partitions?
> >>>>>> Or we should treat join operations in a same way as it is right now
> >>> and basically choose max number of partitions even when `repartition`
> >>> operation is specified, because Kafka Streams “knows the best” how to
> >>> handle joins?
> >>>>>> You can check integration tests how it’s being handled currently. Open
> >>> to suggestions on that part.
> >>>>>>
> >>>>>> As for groupBy, I agree and John raised very interesting points. My
> >>> arguments for allowing users to specify number of partitions during groupBy
> >>> operations mainly was coming from the usability perspective.
> >>>>>> So building on top of what John said, maybe it makes sense to make
> >>> `groupBy` operations smarter and whenever user specifies
> >>> `numberOfPartitions` configuration, repartitioning will be enforced, wdyt?
> >>>>>> I’m not going into optimization part yet :) I think it will be part of
> >>> separate PR and task, but overall it makes sense to apply optimizations
> >>> where number of partitions are the same.
> >>>>>>
> >>>>>> As for Repartitioned extending Grouped, I kinda feel that it won’t fit
> >>> nicely in current API design.
> >>>>>> In addition, in the PR review, John mentioned that there were a lot of
> >>> troubles in the past trying to use one operation's configuration objects on
> >>> other operations.
> >>>>>> Also it makes sense to keep them separate in terms of compatibility.
> >>>>>> In that case, we don’t have to worry every time Grouped is changed,
> >>> what would be the implications on `repartition` operations.
> >>>>>>
> >>>>>> Kind regards,
> >>>>>> Levani
> >>>>>>
> >>>>>>
> >>>>>>> On Nov 11, 2019, at 9:13 PM, John Roesler <jo...@confluent.io> wrote:
> >>>>>>>
> >>>>>>> Ah, thanks for the clarification. I missed your point.
> >>>>>>>
> >>>>>>> I like the framework you've presented. It does seem simpler to assume
> >>>>>>> that they either care about the partition count and want to
> >>>>>>> repartition to realize it, or they don't care about the number.
> >>>>>>> Returning to this discussion, it does seem unlikely that they care
> >>>>>>> about the number and _don't_ care if it actually gets realized.
> >>>>>>>
> >>>>>>> But then, it still seems like we can just keep the option as part of
> >>>>>>> Grouped. As in:
> >>>>>>>
> >>>>>>> // user does not care
> >>>>>>> stream.groupByKey(Grouped /*not specifying partition count*/)
> >>>>>>> stream.groupBy(Grouped /*not specifying partition count*/)
> >>>>>>>
> >>>>>>> // user does care
> >>>>>>> stream.repartition(Repartitioned)
> >>>>>>> stream.groupByKey(Grouped.numberOfPartitions(...))
> >>>>>>> stream.groupBy(Grouped.numberOfPartitions(...))
> >>>>>>>
> >>>>>>> ----
> >>>>>>>
> >>>>>>> The above discussion got me thinking about algebra. Matthias is
> >>>>>>> absolutely right that `groupByKey(numPartitions)` is equivalent to
> >>>>>>> `repartition(numPartitions).groupByKey()`. I'm just not convinced that
> >>>>>>> we should force people to apply that expansion themselves vs. having a
> >>>>>>> more compact way to express it if they don't care where exactly the
> >>>>>>> repartition occurs. However, thinking about these operators
> >>>>>>> algebraically can really help *us* narrow down the number of different
> >>>>>>> expressions we have to consider.
> >>>>>>>
> >>>>>>> Let's consider some identities:
> >>>>>>>
> >>>>>>> A: groupBy(mapper) + agg = mapKey(mapper) + groupByKey + agg
> >>>>>>> B: src + ... + groupByKey + agg = src + ... + passthough + agg
> >>>>>>> C: mapKey(mapper) + ... + groupByKey + agg
> >>>>>>> = mapKey(mapper) + ... + repartition + groupByKey + agg
> >>>>>>> D: repartition = sink(managed) + src
> >>>>>>>
> >>>>>>> In these identities, I used one special identifier (...), which means
> >>>>>>> any number (0+) of operations that are not src, mapKey, groupBy[Key],
> >>>>>>> repartition, or agg.
> >>>>>>>
> >>>>>>> For mental clarity, I'm just going to make up a rule that groupBy
> >>>>>>> operations are not executable. In other words, you have to get to a
> >>>>>>> point where you can apply B to convert a groupByKey into a passthough
> >>>>>>> in order to execute the program. This is just a formal way of stating
> >>>>>>> what already happens in Kafka Streams.
> >>>>>>>
> >>>>>>> By applying A, we can just completely leave `groupBy` out of our
> >>>>>>> analysis. It trivially decomposes into a mapKey followed by a
> >>>>>>> groupByKey.
> >>>>>>>
> >>>>>>> Then, we can eliminate the "repartition required" case of `groupByKey`
> >>>>>>> by applying C followed by D to get to the "no repartition required"
> >>>>>>> version of groupByKey, which in turn sets us up to apply B to get an
> >>>>>>> executable topology.
> >>>>>>>
> >>>>>>> Fundamentally, you can think about KIP-221 is as proposing a modified
> >>>>>>> D identity in which you can specify the partition count of the managed
> >>>>>>> sink topic:
> >>>>>>> D': repartition(pc) = sink(managed w/ pc) + src
> >>>>>>>
> >>>>>>> Since users _could_ apply the identities above, we don't actually have
> >>>>>>> to add any partition count to groupBy[Key], but we decided early on in
> >>>>>>> the KIP discussion that it's more ergonomic to add it. In that case,
> >>>>>>> we also have to modify A and C:
> >>>>>>> A': groupBy(mapper, pc) + agg
> >>>>>>> = mapKey(mapper) + groupByKey(pc) + agg
> >>>>>>> C': mapKey(mapper) + ... + groupByKey(pc) + agg
> >>>>>>> = mapKey(mapper) + ... + repartition(pc) + groupByKey + agg
> >>>>>>>
> >>>>>>> Which sets us up still to always be able to get back to a plain
> >>>>>>> `groupByKey` operation (with no `(pc)`) and then apply D' and
> >>>>>>> ultimately B to get an executable topology.
> >>>>>>>
> >>>>>>> What about the optimizer?
> >>>>>>> The optimizer applies another set of graph-algebraic identities to
> >>>>>>> minimize the number of repartition topics in a topology.
> >>>>>>>
> >>>>>>> (forgive my ascii art)
> >>>>>>>
> >>>>>>> E: (merging repartition nodes)
> >>>>>>> (...) -> repartition -> X
> >>>>>>> \-> repartition -> Y
> >>>>>>> =
> >>>>>>> (... + repartition) -> X
> >>>>>>>    \-> Y
> >>>>>>> F: (reordering around repartition)
> >>>>>>> Where SVO is any non-key-changing, stateless, operation:
> >>>>>>> repartition -> SVO = SVO -> repartition
> >>>>>>>
> >>>>>>> In terms of these identities, what the optimizer does is apply F
> >>>>>>> repeatedly in either direction to a topology to factor out common in
> >>>>>>> branches so that it can apply E to merge repartition nodes. This was
> >>>>>>> especially necessary before KIP-221 because you couldn't directly
> >>>>>>> express `repartition` in the DSL, only indirectly via `groupBy[Key]`,
> >>>>>>> so there was no way to do the factoring manually.
> >>>>>>>
> >>>>>>> We can now state very clearly that in KIP-221, explicit
> >>>>>>> `repartition()` operators should create a "reordering barrier". So, F
> >>>>>>> cannot be applied to an explicit `repartition()`. Also, I think we
> >>>>>>> decided earlier that explicit `repartition()` operations would also be
> >>>>>>> ineligible for merging, so E can't be applied to explicit
> >>>>>>> `repartition()` operations either. I think we feel we _could_ apply E
> >>>>>>> without harm, but we want to be conservative for now.
> >>>>>>>
> >>>>>>> I think the salient point from the latter discussion has been that
> >>>>>>> when you use `Grouped.numberOfPartitions`, this does _not_ constitute
> >>>>>>> an explicit `repartition()` operator, and therefore, the resulting
> >>>>>>> repartition node remains eligible for optimization.
> >>>>>>>
> >>>>>>> To be clear, I agree with Matthias that the provided partition count
> >>>>>>> _must_ be used in the resulting implicit repartition. This has some
> >>>>>>> implications for E. Namely, E could only be applied to two repartition
> >>>>>>> nodes that have the same partition count. This has always been
> >>>>>>> trivially true before KIP-221 because the partition count has always
> >>>>>>> been "unspecified", i.e., it would be determined at runtime by the
> >>>>>>> user-managed-topics' partition counts. Now, it could be specified or
> >>>>>>> unspecified. We can simply augment E to allow merging only repartition
> >>>>>>> nodes where the partition count is EITHER "specified and the same on
> >>>>>>> both sides", OR "unspecified on both sides".
> >>>>>>>
> >>>>>>> Sorry for the long email, but I have a hope that it builds a solid
> >>>>>>> theoretical foundation for our decisions in KIP-221, so we can have
> >>>>>>> confidence that there are no edge cases for design flaws to hide.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> -John
> >>>>>>>
> >>>>>>> On Sat, Nov 9, 2019 at 10:37 PM Matthias J. Sax <
> >>> matthias@confluent.io> wrote:
> >>>>>>>>
> >>>>>>>>> it seems like we do want to allow
> >>>>>>>>>> people to optionally specify a partition count as part of this
> >>>>>>>>>> operation, but we don't want that option to _force_ repartitioning
> >>>>>>>>
> >>>>>>>> Correct, ie, that is my suggestions.
> >>>>>>>>
> >>>>>>>>> "Use P partitions if repartitioning is necessary"
> >>>>>>>>
> >>>>>>>> I disagree here, because my reasoning is that:
> >>>>>>>>
> >>>>>>>> - if a user cares about the number of partition, the user wants to
> >>>>>>>> enforce a repartitioning
> >>>>>>>> - if a user does not case about the number of partitions, we don't
> >>> need
> >>>>>>>> to provide them a way to pass in a "hint"
> >>>>>>>>
> >>>>>>>> Hence, it should be sufficient to support:
> >>>>>>>>
> >>>>>>>> // user does not care
> >>>>>>>>
> >>>>>>>> `stream.groupByKey(Grouped)`
> >>>>>>>> `stream.grouBy(..., Grouped)`
> >>>>>>>>
> >>>>>>>> // user does care
> >>>>>>>>
> >>>>>>>> `stream.repartition(Repartitioned).groupByKey()`
> >>>>>>>> `streams.groupBy(..., Repartitioned)`
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> -Matthias
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 11/9/19 8:10 PM, John Roesler wrote:
> >>>>>>>>> Thanks for those thoughts, Matthias,
> >>>>>>>>>
> >>>>>>>>> I find your reasoning about the optimization behavior compelling.
> >>> The
> >>>>>>>>> `through` operation is very simple and clear to reason about. It
> >>> just
> >>>>>>>>> passes the data exactly at the specified point in the topology
> >>> exactly
> >>>>>>>>> through the specified topic. Likewise, if a user invokes a
> >>>>>>>>> `repartition` operator, the simplest behavior is if we just do what
> >>>>>>>>> they asked for.
> >>>>>>>>>
> >>>>>>>>> Stepping back to think about when optimizations are surprising and
> >>>>>>>>> when they aren't, it occurs to me that we should be free to move
> >>>>>>>>> around repartitions when users have asked to perform some operation
> >>>>>>>>> that implies a repartition, like "change keys, then filter, then
> >>>>>>>>> aggregate". This program requires a repartition, but it could be
> >>>>>>>>> anywhere between the key change and the aggregation. On the other
> >>>>>>>>> hand, if they say, "change keys, then filter, then repartition, then
> >>>>>>>>> aggregate", it seems like they were pretty clear about their desire,
> >>>>>>>>> and we should just take it at face value.
> >>>>>>>>>
> >>>>>>>>> So, I'm sold on just literally doing a repartition every time they
> >>>>>>>>> invoke the `repartition` operator.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> The "partition count" modifier for `groupBy`/`groupByKey` is more
> >>> nuanced.
> >>>>>>>>>
> >>>>>>>>> What you said about `groupByKey` makes sense. If they key hasn't
> >>>>>>>>> actually changed, then we don't need to repartition before
> >>>>>>>>> aggregating. On the other hand, `groupBy` is specifically changing
> >>> the
> >>>>>>>>> key as part of the grouping operation, so (as you said) we
> >>> definitely
> >>>>>>>>> have to do a repartition.
> >>>>>>>>>
> >>>>>>>>> If I'm reading the discussion right, it seems like we do want to
> >>> allow
> >>>>>>>>> people to optionally specify a partition count as part of this
> >>>>>>>>> operation, but we don't want that option to _force_ repartitioning
> >>> if
> >>>>>>>>> it's not needed. That last clause is the key. "Use P partitions if
> >>>>>>>>> repartitioning is necessary" is a directive that applies cleanly and
> >>>>>>>>> correctly to both `groupBy` and `groupByKey`. What if we call the
> >>>>>>>>> option `numberOfPartitionsHint`, which along with the "if necessary"
> >>>>>>>>> javadoc, should make it clear that the option won't force a
> >>>>>>>>> repartition, and also gives us enough latitude to still employ the
> >>>>>>>>> optimizer on those repartition topics?
> >>>>>>>>>
> >>>>>>>>> If we like the idea of expressing it as a "hint" for grouping and a
> >>>>>>>>> "command" for `repartition`, then it seems like it still makes sense
> >>>>>>>>> to keep Grouped and Repartitioned separate, as they would actually
> >>>>>>>>> offer different methods with distinct semantics.
> >>>>>>>>>
> >>>>>>>>> WDYT?
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> -John
> >>>>>>>>>
> >>>>>>>>> On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <
> >>> matthias@confluent.io> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Sorry for late reply.
> >>>>>>>>>>
> >>>>>>>>>> I guess, the question boils down to the intended semantics of
> >>>>>>>>>> `repartition()`. My understanding is as follows:
> >>>>>>>>>>
> >>>>>>>>>> - KS does auto-repartitioning for correctness reasons (using the
> >>>>>>>>>> upstream topic to determine the number of partitions)
> >>>>>>>>>> - KS does auto-repartitioning only for downstream DSL operators
> >>> like
> >>>>>>>>>> `count()` (eg, a `transform()` does never trigger an
> >>> auto-repartitioning
> >>>>>>>>>> even if the stream is marked as `repartitioningRequired`).
> >>>>>>>>>> - KS offers `through()` to enforce a repartitioning -- however,
> >>> the user
> >>>>>>>>>> needs to create the topic manually (with the desired number of
> >>> partitions).
> >>>>>>>>>>
> >>>>>>>>>> I see two main applications for `repartitioning()`:
> >>>>>>>>>>
> >>>>>>>>>> 1) repartition data before a `transform()` but user does not want
> >>> to
> >>>>>>>>>> manage the topic
> >>>>>>>>>> 2) scale out a downstream subtopology
> >>>>>>>>>>
> >>>>>>>>>> Hence, I see `repartition()` similar to `through()`: if a users
> >>> calls
> >>>>>>>>>> it, a repartitining is enforced, with the difference that KS
> >>> manages the
> >>>>>>>>>> topic and the user does not need to create it.
> >>>>>>>>>>
> >>>>>>>>>> This behavior makes (1) and (2) possible.
> >>>>>>>>>>
> >>>>>>>>>>> I think many users would prefer to just say "if there *is* a
> >>> repartition
> >>>>>>>>>>> required at this point in the topology, it should
> >>>>>>>>>>> have N partitions"
> >>>>>>>>>>
> >>>>>>>>>> Because of (2), I disagree. Either a user does not care about
> >>> scaling
> >>>>>>>>>> out, for which case she would not specify the number of
> >>> partitions. Or a
> >>>>>>>>>> user does care, and hence wants to enforce the scale out. I don't
> >>> think
> >>>>>>>>>> that any user would say, "maybe scale out".
> >>>>>>>>>>
> >>>>>>>>>> Therefore, the optimizer should never ignore the repartition
> >>> operation.
> >>>>>>>>>> As a "consequence" (because repartitioning is expensive) a user
> >>> should
> >>>>>>>>>> make an explicit call to `repartition()` IMHO -- piggybacking an
> >>>>>>>>>> enforced repartitioning into `groupByKey()` seems to be "dangerous"
> >>>>>>>>>> because it might be too subtle and an "optional scaling out" as
> >>> laid out
> >>>>>>>>>> above does not make sense IMHO.
> >>>>>>>>>>
> >>>>>>>>>> I am also not worried about "over repartitioning" because the
> >>> result
> >>>>>>>>>> stream would never trigger auto-repartitioning. Only if multiple
> >>>>>>>>>> consecutive calls to `repartition()` are made it could be bad --
> >>> but
> >>>>>>>>>> that's the same with `through()`. In the end, there is always some
> >>>>>>>>>> responsibility on the user.
> >>>>>>>>>>
> >>>>>>>>>> Btw, for `.groupBy()` we know that repartitioning will be required,
> >>>>>>>>>> however, for `groupByKey()` it depends if the KStream is marked as
> >>>>>>>>>> `repartitioningRequired`.
> >>>>>>>>>>
> >>>>>>>>>> Hence, for `groupByKey()` it should not be possible for a user to
> >>> set
> >>>>>>>>>> number of partitions IMHO. For `groupBy()` it's a different story,
> >>>>>>>>>> because calling
> >>>>>>>>>>
> >>>>>>>>>>  `repartition().groupBy()`
> >>>>>>>>>>
> >>>>>>>>>> does not achieve what we want. Hence, allowing users to pass in the
> >>>>>>>>>> number of users partitions into `groupBy()` does actually makes
> >>> sense,
> >>>>>>>>>> because repartitioning will happen anyway and thus we can
> >>> piggyback a
> >>>>>>>>>> scaling decision.
> >>>>>>>>>>
> >>>>>>>>>> I think that John has a fair concern about the overloads, however,
> >>> I am
> >>>>>>>>>> not convinced that using `Grouped` to specify the number of
> >>> partitions
> >>>>>>>>>> is intuitive. I double checked `Grouped` and `Repartitioned` and
> >>> both
> >>>>>>>>>> allow to specify a `name` and `keySerde/valueSerde`. Thus, I am
> >>>>>>>>>> wondering if we could bridge the gap between both, if we would make
> >>>>>>>>>> `Repartitioned extends Grouped`? For this case, we only need
> >>>>>>>>>> `groupBy(Grouped)` and a user can pass in both types what seems to
> >>> make
> >>>>>>>>>> the API quite smooth:
> >>>>>>>>>>
> >>>>>>>>>> `stream.groupBy(..., Grouped...)`
> >>>>>>>>>>
> >>>>>>>>>> `stream.groupBy(..., Repartitioned...)`
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Thoughts?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> -Matthias
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
> >>>>>>>>>>> Hi Sophie,
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you for your reply, very insightful. Looking forward
> >>> hearing others opinion as well on this.
> >>>>>>>>>>>
> >>>>>>>>>>> Kind regards,
> >>>>>>>>>>> Levani
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <
> >>> sophie@confluent.io> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
> >>> other hand
> >>>>>>>>>>>> Kafka Streams has already
> >>>>>>>>>>>>> optimizer in place which alters topology independently from user
> >>>>>>>>>>>>
> >>>>>>>>>>>> I agree (with you) and think this is a good way to put it -- we
> >>> currently
> >>>>>>>>>>>> auto-repartition for the user so
> >>>>>>>>>>>> that they don't have to walk through their entire topology and
> >>> reason about
> >>>>>>>>>>>> when and where to place a
> >>>>>>>>>>>> `.through` (or the new `.repartition`), so why suddenly force
> >>> this onto the
> >>>>>>>>>>>> user? How certain are we that
> >>>>>>>>>>>> users will always get this right? It's easy to imagine that
> >>> during
> >>>>>>>>>>>> development, you write your new app with
> >>>>>>>>>>>> correctly placed repartitions in order to use this new feature.
> >>> During the
> >>>>>>>>>>>> course of development you end up
> >>>>>>>>>>>> tweaking the topology, but don't remember to review or move the
> >>>>>>>>>>>> repartitioning since you're used to Streams
> >>>>>>>>>>>> doing this for you. If you use only single-partition topics for
> >>> testing,
> >>>>>>>>>>>> you might not even notice your app is
> >>>>>>>>>>>> spitting out incorrect results!
> >>>>>>>>>>>>
> >>>>>>>>>>>> Anyways, I feel pretty strongly that it would be weird to
> >>> introduce a new
> >>>>>>>>>>>> feature and say that to use it, you can't take
> >>>>>>>>>>>> advantage of this other feature anymore. Also, is it possible our
> >>>>>>>>>>>> optimization framework could ever include an
> >>>>>>>>>>>> optimized repartitioning strategy that is better than what a
> >>> user could
> >>>>>>>>>>>> achieve by manually inserting repartitions?
> >>>>>>>>>>>> Do we expect users to have a deep understanding of the best way
> >>> to
> >>>>>>>>>>>> repartition their particular topology, or is it
> >>>>>>>>>>>> likely they will end up over-repartitioning either due to missed
> >>>>>>>>>>>> optimizations or unnecessary extra repartitions?
> >>>>>>>>>>>> I think many users would prefer to just say "if there *is* a
> >>> repartition
> >>>>>>>>>>>> required at this point in the topology, it should
> >>>>>>>>>>>> have N partitions"
> >>>>>>>>>>>>
> >>>>>>>>>>>> As to the idea of adding `numberOfPartitions` to Grouped rather
> >>> than
> >>>>>>>>>>>> adding a new parameter to groupBy, that does seem more in line
> >>> with the
> >>>>>>>>>>>> current syntax so +1 from me
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <
> >>> levani.codes@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hello all,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> While https://github.com/apache/kafka/pull/7170 <
> >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> is under review and
> >>> it’s
> >>>>>>>>>>>>> almost done, I want to resurrect discussion about this KIP to
> >>> address
> >>>>>>>>>>>>> couple of concerns raised by Matthias and John.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As a reminder, idea of the KIP-221 was to allow DSL users
> >>> control over
> >>>>>>>>>>>>> repartitioning and parallelism of sub-topologies by:
> >>>>>>>>>>>>> 1) Introducing new KStream#repartition operation which is done
> >>> in
> >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
> >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>
> >>>>>>>>>>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which is
> >>> planned to
> >>>>>>>>>>>>> be separate PR.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> While all agree about general implementation and idea behind
> >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
> >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> PR, introducing new
> >>>>>>>>>>>>> KStream#groupBy(Repartitioned) method overload raised some
> >>> questions during
> >>>>>>>>>>>>> the review.
> >>>>>>>>>>>>> Matthias raised concern that there can be cases when user uses
> >>>>>>>>>>>>> `KStream#groupBy(Repartitioned)` operation, but actual
> >>> repartitioning may
> >>>>>>>>>>>>> not required, thus configuration passed via `Repartitioned`
> >>> would never be
> >>>>>>>>>>>>> applied (Matthias, please correct me if I misinterpreted your
> >>> comment).
> >>>>>>>>>>>>> So instead, if user wants to control parallelism of
> >>> sub-topologies, he or
> >>>>>>>>>>>>> she should always use `KStream#repartition` operation before
> >>> groupBy. Full
> >>>>>>>>>>>>> comment can be seen here:
> >>>>>>>>>>>>>
> >>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
> >>>>>>>>>>>>>
> >>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On the same topic, John pointed out that, from API design
> >>> perspective, we
> >>>>>>>>>>>>> shouldn’t intertwine configuration classes of different
> >>> operators between
> >>>>>>>>>>>>> one another. So instead of introducing new
> >>> `KStream#groupBy(Repartitioned)`
> >>>>>>>>>>>>> for specifying number of partitions for internal topic, we
> >>> should update
> >>>>>>>>>>>>> existing `Grouped` class with `numberOfPartitions` field.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
> >>> other hand
> >>>>>>>>>>>>> Kafka Streams has already optimizer in place which alters
> >>> topology
> >>>>>>>>>>>>> independently from user. So maybe it makes sense if Kafka
> >>> Streams,
> >>>>>>>>>>>>> internally would optimize topology in the best way possible,
> >>> even if in
> >>>>>>>>>>>>> some cases this means ignoring some operator configurations
> >>> passed by the
> >>>>>>>>>>>>> user. Also, I agree with John about API design semantics. If we
> >>> go through
> >>>>>>>>>>>>> with the changes for `KStream#groupBy` operation, it makes more
> >>> sense to
> >>>>>>>>>>>>> add `numberOfPartitions` field to `Grouped` class instead of
> >>> introducing
> >>>>>>>>>>>>> new `KStream#groupBy(Repartitioned)` method overload.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I would really appreciate communities feedback on this.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <
> >>> sophie@confluent.io>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hey Levani,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I think people are busy with the upcoming 2.4 release, and
> >>> don't have
> >>>>>>>>>>>>> much
> >>>>>>>>>>>>>> spare time at the
> >>>>>>>>>>>>>> moment. It's kind of a difficult time to get attention on
> >>> things, but
> >>>>>>>>>>>>> feel
> >>>>>>>>>>>>>> free to pick up something else
> >>>>>>>>>>>>>> to work on in the meantime until things have calmed down a bit!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>> Sophie
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
> >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hello all,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Sorry for bringing this thread again, but I would like to get
> >>> some
> >>>>>>>>>>>>>>> attention on this PR:
> >>> https://github.com/apache/kafka/pull/7170 <
> >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> <
> >>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
> >>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>>
> >>>>>>>>>>>>>>> It's been a while now and I would love to move on to other
> >>> KIPs as well.
> >>>>>>>>>>>>>>> Please let me know if you have any concerns.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
> >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Here’s voting thread for this KIP:
> >>>>>>>>>>>>>>>
> >>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
> >>>>>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>
> >>> <
> >>>>>>>>>>>>>>>
> >>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
> >>>>>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html
> >>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
> >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>>
> >>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Matthias,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion on
> >>> that one.
> >>>>>>>>>>>>>>>>> Agree that avoiding unnecessary method overloads is a good
> >>> idea.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Updated KIP
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <
> >>> matthias@confluent.io
> >>>>>>>>>>>>> <ma...@confluent.io>
> >>>>>>>>>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>>>
> >>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> One question:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Why do we add
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Repartitioned#with(final String name, final int
> >>> numberOfPartitions)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> It seems that `#with(String name)`,
> >>> `#numberOfPartitions(int)` in
> >>>>>>>>>>>>>>>>>> combination with `withName()` and
> >>> `withNumberOfPartitions()` should
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>> sufficient. Users can chain the method calls.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> (I think it's valuable to keep the number of overload
> >>> small if
> >>>>>>>>>>>>>>> possible.)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Otherwise LGTM.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
> >>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks all for your feedback.
> >>>>>>>>>>>>>>>>>>> I started voting procedure for this KIP. If there’re any
> >>> other
> >>>>>>>>>>>>>>> concerns about this KIP, please let me know.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
> >>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >>> <mailto:
> >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Matthias,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, makes sense.
> >>>>>>>>>>>>>>>>>>>> I’ve updated KIP (
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>> <
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>> <
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>> <
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>> <
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> ).
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
> >>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>
> >>>>>>>>>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>>
> >>> <mailto:
> >>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io> <mailto:
> >>>>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>>>>
> >>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks for driving the KIP.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I agree that users need to be able to specify a
> >>> partitioning
> >>>>>>>>>>>>>>> strategy.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Sophie raises a fair point about topic configs and
> >>> producer
> >>>>>>>>>>>>>>> configs. My
> >>>>>>>>>>>>>>>>>>>>> take is, that consider `Repartitioned` as an
> >>> "extension" to
> >>>>>>>>>>>>>>> `Produced`,
> >>>>>>>>>>>>>>>>>>>>> that adds topic configuration, is a good way to think
> >>> about it and
> >>>>>>>>>>>>>>> helps
> >>>>>>>>>>>>>>>>>>>>> to keep the API "clean".
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> With regard to method names. I would prefer to avoid
> >>>>>>>>>>>>> abbreviations.
> >>>>>>>>>>>>>>> Can
> >>>>>>>>>>>>>>>>>>>>> we rename:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Furthermore, it might be good to add some more `static`
> >>> methods:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
> >>>>>>>>>>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
> >>>>>>>>>>>>>>>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
> >>>>>>>>>>>>>>>>>>>>>> Totally agree. I think in KStream interface it makes
> >>> sense to
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>>> some duplicate configurations between operators in order to
> >>> keep API
> >>>>>>>>>>>>> simple
> >>>>>>>>>>>>>>> and usable.
> >>>>>>>>>>>>>>>>>>>>>> Also, as more surface API has, harder it is to have
> >>> proper
> >>>>>>>>>>>>>>> backward compatibility.
> >>>>>>>>>>>>>>>>>>>>>> While initial idea of keeping topic level configs
> >>> separate was
> >>>>>>>>>>>>>>> exciting, having Repartitioned class encapsulate some
> >>> producer level
> >>>>>>>>>>>>>>> configs makes API more readable.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
> >>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> >>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> >>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I think that is a good point about trying to keep
> >>> producer level
> >>>>>>>>>>>>>>>>>>>>>>> configurations and (repartition) topic level
> >>> considerations
> >>>>>>>>>>>>>>> separate.
> >>>>>>>>>>>>>>>>>>>>>>> Number of partitions is definitely purely a topic
> >>> level
> >>>>>>>>>>>>>>> configuration. But
> >>>>>>>>>>>>>>>>>>>>>>> on some level, serdes and partitioners are just as
> >>> much a topic
> >>>>>>>>>>>>>>>>>>>>>>> configuration as a producer one. You could have two
> >>> producers
> >>>>>>>>>>>>>>> configured
> >>>>>>>>>>>>>>>>>>>>>>> with different serdes and/or partitioners, but if
> >>> they are
> >>>>>>>>>>>>>>> writing to the
> >>>>>>>>>>>>>>>>>>>>>>> same topic the result would be very difficult to
> >>> part. So in a
> >>>>>>>>>>>>>>> sense, these
> >>>>>>>>>>>>>>>>>>>>>>> are configurations of topics in Streams, not just
> >>> producers.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Another way to think of it: while the Streams API is
> >>> not always
> >>>>>>>>>>>>>>> true to
> >>>>>>>>>>>>>>>>>>>>>>> this, ideally all the relevant configs for an
> >>> operator are
> >>>>>>>>>>>>>>> wrapped into a
> >>>>>>>>>>>>>>>>>>>>>>> single object (in this case, Repartitioned). We could
> >>> instead
> >>>>>>>>>>>>>>> split out the
> >>>>>>>>>>>>>>>>>>>>>>> fields in common with Produced into a separate
> >>> parameter to keep
> >>>>>>>>>>>>>>> topic and
> >>>>>>>>>>>>>>>>>>>>>>> producer level configurations separate, but this
> >>> increases the
> >>>>>>>>>>>>>>> API surface
> >>>>>>>>>>>>>>>>>>>>>>> area by a lot. It's much more straightforward to just
> >>> say "this
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>> everything that this particular operator needs"
> >>> without worrying
> >>>>>>>>>>>>>>> about what
> >>>>>>>>>>>>>>>>>>>>>>> exactly you're specifying.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I suppose you could alternatively make Produced a
> >>> field of
> >>>>>>>>>>>>>>> Repartitioned,
> >>>>>>>>>>>>>>>>>>>>>>> but I don't think we do this kind of composition
> >>> elsewhere in
> >>>>>>>>>>>>>>> Streams at
> >>>>>>>>>>>>>>>>>>>>>>> the moment
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
> >>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >>> <mailto:
> >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
> >>> <mailto:
> >>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >>> <mailto:
> >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hi Bill,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot for the feedback.
> >>>>>>>>>>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
> >>>>>>>>>>>>>>> `Repartitioned#partitioner`
> >>>>>>>>>>>>>>>>>>>>>>>> configuration.
> >>>>>>>>>>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class for
> >>> topic level
> >>>>>>>>>>>>>>>>>>>>>>>> configuration and keep topic level and producer level
> >>>>>>>>>>>>>>> configurations (such
> >>>>>>>>>>>>>>>>>>>>>>>> as Produced) separately (see my second email in this
> >>> thread).
> >>>>>>>>>>>>>>>>>>>>>>>> But while looking at the semantics of KStream
> >>> interface, I
> >>>>>>>>>>>>>>> couldn’t really
> >>>>>>>>>>>>>>>>>>>>>>>> figure out good operation name for Topic level
> >>> configuration
> >>>>>>>>>>>>>>> class and just
> >>>>>>>>>>>>>>>>>>>>>>>> introducing `Topic` config class was kinda breaking
> >>> the
> >>>>>>>>>>>>>>> semantics.
> >>>>>>>>>>>>>>>>>>>>>>>> So I think having Repartitioned class which
> >>> encapsulates topic
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>> producer level configurations for internal topics is
> >>> viable
> >>>>>>>>>>>>>>> thing to do.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <
> >>> bbejeck@gmail.com
> >>>>>>>>>>>>> <ma...@gmail.com>
> >>>>>>>>>>>>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>>
> >>> <mailto:
> >>>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
> >>>>>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Hi Lavani,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.  In
> >>> addition to
> >>>>>>>>>>>>>>> the reason
> >>>>>>>>>>>>>>>>>>>>>>>>> provided by John, my reasoning is:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> 1. Users may want to use something other than
> >>> hash-based
> >>>>>>>>>>>>>>> partitioning
> >>>>>>>>>>>>>>>>>>>>>>>>> 2. Users may wish to partition on something
> >>> different than the
> >>>>>>>>>>>>>>> key
> >>>>>>>>>>>>>>>>>>>>>>>>> without having to change the key.  For example:
> >>>>>>>>>>>>>>>>>>>>>>>>> 1. A combination of fields in the value in
> >>> conjunction with
> >>>>>>>>>>>>>>> the key
> >>>>>>>>>>>>>>>>>>>>>>>>> 2. Something other than the key
> >>>>>>>>>>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on
> >>> Produced hence
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes sense
> >>> for API
> >>>>>>>>>>>>>>> consistency.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Just my  2 cents.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>> Bill
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> >>> levani.codes@gmail.com>
> >>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
> >>> <mailto:
> >>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >>> <mailto:
> >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi John,
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> In my mind it makes sense.
> >>>>>>>>>>>>>>>>>>>>>>>>>> If we add partitioner configuration to
> >>> Repartitioned class,
> >>>>>>>>>>>>>>> with the
> >>>>>>>>>>>>>>>>>>>>>>>>>> combination of specifying number of partitions for
> >>> internal
> >>>>>>>>>>>>>>> topics, user
> >>>>>>>>>>>>>>>>>>>>>>>>>> will have opportunity to ensure co-partitioning
> >>> before join
> >>>>>>>>>>>>>>> operation.
> >>>>>>>>>>>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
> >>>>>>>>>>>>>>>>>>>>>>>>>> Wondering what others think about this?
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
> >>>>>>>>>>>>> john@confluent.io <ma...@confluent.io>
> >>>>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>
> >>> <mailto:
> >>>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
> >>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again,
> >>> not totally
> >>>>>>>>>>>>>>> sure it
> >>>>>>>>>>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar is
> >>> the
> >>>>>>>>>>>>> rationale
> >>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
> >>>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
> >>> levani.codes@gmail.com>
> >>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
> >>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>
> >>> <mailto:
> >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey John,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your example
> >>> I would
> >>>>>>>>>>>>>>> first issue
> >>>>>>>>>>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner
> >>> that
> >>>>>>>>>>>>>>> essentially
> >>>>>>>>>>>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>>>>>>>>>> be the same as the topic I want to join with and
> >>> then do the
> >>>>>>>>>>>>>>>>>>>>>>>> KStream#join
> >>>>>>>>>>>>>>>>>>>>>>>>>> with DSL?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
> >>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
> >>> john@confluent.io
> >>>>>>>>>>>>> <ma...@confluent.io>> <mailto:john@confluent.io <mailto:
> >>>>>>>>>>>>> john@confluent.io>
> >>>>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an option to
> >>> specify
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is that
> >>> some data may
> >>>>>>>>>>>>>>> get
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input
> >>> topic. If the
> >>>>>>>>>>>>>>> right-side
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning
> >>> strategy, then the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be
> >>> partitioned with the
> >>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss
> >>> something
> >>>>>>>>>>>>>>> important?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani
> >>> Kokhreidze
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
> >>> levani.codes@gmail.com>
> >>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
> >>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>
> >>> <mailto:
> >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be
> >>> honest I’m
> >>>>>>>>>>>>> not
> >>>>>>>>>>>>>>> sure
> >>>>>>>>>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>>>>>>>> it yet.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really
> >>> think I would
> >>>>>>>>>>>>>>> need control
> >>>>>>>>>>>>>>>>>>>>>>>>>> over partitioner for internal topics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams
> >>> knows best
> >>>>>>>>>>>>>>> how to
> >>>>>>>>>>>>>>>>>>>>>>>>>> partition data for internal topics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be
> >>> used only for
> >>>>>>>>>>>>>>> topics
> >>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>> are created by user In advance.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have
> >>> possibility to
> >>>>>>>>>>>>>>> specify
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> partitioner.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I
> >>> guess
> >>>>>>>>>>>>>>> specifying the
> >>>>>>>>>>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s
> >>> agreement on
> >>>>>>>>>>>>> this.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie
> >>> Blee-Goldman <
> >>>>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
> >>> <mailto:
> >>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that
> >>> Repartitioned
> >>>>>>>>>>>>>>> would be a
> >>>>>>>>>>>>>>>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need
> >>> to have
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar
> >>> to
> >>>>>>>>>>>>>>> Produced? I'm not
> >>>>>>>>>>>>>>>>>>>>>>>>>> sure how
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but seems
> >>> it should
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>> available
> >>>>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartition topics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani
> >>> Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> >>> levani.codes@gmail.com>
> >>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>
> >>> <mailto:
> >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
> >>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka
> >>> Streams.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more
> >>> control over
> >>>>>>>>>>>>>>> the topic
> >>>>>>>>>>>>>>>>>>>>>>>>>> such as
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num of partitions.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is
> >>> something that
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>> missing
> >>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current DSL design.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over
> >>> parallelism by
> >>>>>>>>>>>>>>> configuring
> >>>>>>>>>>>>>>>>>>>>>>>> num
> >>>>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie
> >>> Blee-Goldman <
> >>>>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
> >>> <mailto:
> >>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one
> >>> thing for me
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>>>> for the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a
> >>> Repartitioned,
> >>>>>>>>>>>>>>> will the
> >>>>>>>>>>>>>>>>>>>>>>>>>> topic be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be
> >>> the case
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> signature
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to
> >>> be
> >>>>>>>>>>>>>>> pre-created? The
> >>>>>>>>>>>>>>>>>>>>>>>>>> wording
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of
> >>> the method
> >>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>>>>> auto-create
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sophie
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani
> >>> Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> >>> levani.codes@gmail.com>
> >>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>
> >>> <mailto:
> >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>> <
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>> <
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>> <
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>> <
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>> <
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> )
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities
> >>> opinions/concerns
> >>>>>>>>>>>>> about
> >>>>>>>>>>>>>>> this KIP.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani
> >>> Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani
> >>> Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve
> >>> updated
> >>>>>>>>>>>>>>> confluence
> >>>>>>>>>>>>>>>>>>>>>>>> page
> >>>>>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives”
> >>> section.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani
> >>> Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing
> >>> dedicated `Topic`
> >>>>>>>>>>>>>>> class for
> >>>>>>>>>>>>>>>>>>>>>>>>>> topic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators like
> >>>>>>>>>>>>> `groupedBy`.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion
> >>> about this
> >>>>>>>>>>>>> as
> >>>>>>>>>>>>>>> well.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias
> >>> J. Sax <
> >>>>>>>>>>>>>>>>>>>>>>>>>> matthias@confluent.io
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And
> >>> thanks for
> >>>>>>>>>>>>>>> summarizing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been
> >>> discussed
> >>>>>>>>>>>>>>> already (can't
> >>>>>>>>>>>>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good
> >>> summary to
> >>>>>>>>>>>>>>> refresh the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense.
> >>> With regard
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> distinction
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics
> >>> and
> >>>>>>>>>>>>> operators
> >>>>>>>>>>>>>>> that use
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user-created
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it
> >>> might
> >>>>>>>>>>>>> indicate
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>> leaving
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that
> >>> uses
> >>>>>>>>>>>>>>> use-defined
> >>>>>>>>>>>>>>>>>>>>>>>>>> topics) and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()`
> >>> operator (an
> >>>>>>>>>>>>>>> operator that
> >>>>>>>>>>>>>>>>>>>>>>>>>> manages
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good.
> >>> Otherwise, there is
> >>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages
> >>> topics but
> >>>>>>>>>>>>>>> sometimes
> >>>>>>>>>>>>>>>>>>>>>>>> not; a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the
> >>> distinction
> >>>>>>>>>>>>>>> clearer.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to
> >>> `Grouped`. I am
> >>>>>>>>>>>>>>> wondering
> >>>>>>>>>>>>>>>>>>>>>>>>>> if the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply
> >>> and adding
> >>>>>>>>>>>>>>> it is
> >>>>>>>>>>>>>>>>>>>>>>>>>> semantically
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get
> >>> opinions of
> >>>>>>>>>>>>>>> others on
> >>>>>>>>>>>>>>>>>>>>>>>>>> this, too.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> am
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer
> >>> atm.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects
> >>> that allow
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> configure
> >>>>>>>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer,
> >>> store. If we
> >>>>>>>>>>>>>>> assume that
> >>>>>>>>>>>>>>>>>>>>>>>>>> a topic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we
> >>> should have a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class
> >>> and method
> >>>>>>>>>>>>>>> instead of
> >>>>>>>>>>>>>>>>>>>>>>>>>> a plain
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add
> >>> other
> >>>>>>>>>>>>> configs,
> >>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>> replication
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily,
> >>> without the
> >>>>>>>>>>>>>>> need to
> >>>>>>>>>>>>>>>>>>>>>>>>>> change the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "main
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure
> >>> if I like
> >>>>>>>>>>>>>>> them
> >>>>>>>>>>>>>>>>>>>>>>>> myself.
> >>>>>>>>>>>>>>>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze
> >>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though -
> >>> maybe
> >>>>>>>>>>>>> enhancing
> >>>>>>>>>>>>>>> Produced
> >>>>>>>>>>>>>>>>>>>>>>>>>> with num
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the
> >>> best approach.
> >>>>>>>>>>>>>>> Let me
> >>>>>>>>>>>>>>>>>>>>>>>>>> explain
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with
> >>> this
> >>>>>>>>>>>>>>> configuration,
> >>>>>>>>>>>>>>>>>>>>>>>>>> this will
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since
> >>> KStream#to is
> >>>>>>>>>>>>>>> the final
> >>>>>>>>>>>>>>>>>>>>>>>>>> sink of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable
> >>>>>>>>>>>>> assumption
> >>>>>>>>>>>>>>> that user
> >>>>>>>>>>>>>>>>>>>>>>>>>> needs
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And
> >>> in that
> >>>>>>>>>>>>>>> case, having
> >>>>>>>>>>>>>>>>>>>>>>>>>> num of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much
> >>> sense.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based
> >>> on API
> >>>>>>>>>>>>>>> contract, seems
> >>>>>>>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that
> >>> is
> >>>>>>>>>>>>>>> explicitly for
> >>>>>>>>>>>>>>>>>>>>>>>>>> producer
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (key
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner
> >>> those all
> >>>>>>>>>>>>>>> are producer
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is
> >>> topic level
> >>>>>>>>>>>>>>>>>>>>>>>>>> configuration. And
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
> >>>>>>>>>>>>>>> configurations
> >>>>>>>>>>>>>>>>>>>>>>>>>> together in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface,
> >>> seems like
> >>>>>>>>>>>>>>> Produced
> >>>>>>>>>>>>>>>>>>>>>>>>>> parameter is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal
> >>> (e.g
> >>>>>>>>>>>>> topics
> >>>>>>>>>>>>>>> created
> >>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> managed
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I
> >>> think we
> >>>>>>>>>>>>>>> should leave
> >>>>>>>>>>>>>>>>>>>>>>>>>> it as
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in that case.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account,
> >>> I think we
> >>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>>>>>>> distinguish
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka
> >>> Streams should
> >>>>>>>>>>>>>>> create and
> >>>>>>>>>>>>>>>>>>>>>>>>>> manage
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs
> >>> topics that
> >>>>>>>>>>>>>>> should be
> >>>>>>>>>>>>>>>>>>>>>>>>>> created in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding
> >>> numPartitions
> >>>>>>>>>>>>>>> configuration in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer
> >>> level
> >>>>>>>>>>>>>>> configuration in
> >>>>>>>>>>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API
> >>> semantics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional
> >>> in
> >>>>>>>>>>>>>>> KStream#through - I
> >>>>>>>>>>>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving
> >>> users
> >>>>>>>>>>>>>>> possibility to
> >>>>>>>>>>>>>>>>>>>>>>>>>> specify
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :)
> >>>>>>>>>>>>> Considering
> >>>>>>>>>>>>>>> arguments
> >>>>>>>>>>>>>>>>>>>>>>>>>> against
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced
> >>> class, I see two
> >>>>>>>>>>>>>>> options
> >>>>>>>>>>>>>>>>>>>>>>>>>> here:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
> >>> auto-generated and
> >>>>>>>>>>>>>>> num of
> >>>>>>>>>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions)
> >>> - topic
> >>>>>>>>>>>>> will
> >>>>>>>>>>>>>>> be auto
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions,
> >>> final
> >>>>>>>>>>>>>>> Produced<K, V>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated
> >>> with
> >>>>>>>>>>>>>>> specified num of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
> >>> parameter.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and
> >>> introduce
> >>>>>>>>>>>>>>> new method
> >>>>>>>>>>>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias
> >>> suggested this
> >>>>>>>>>>>>>>> in one of
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> threads)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I
> >>> propose the
> >>>>>>>>>>>>>>> following
> >>>>>>>>>>>>>>>>>>>>>>>> plan:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
> >>> configuration to
> >>>>>>>>>>>>> Grouped
> >>>>>>>>>>>>>>> class (as
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
> >>>>>>>>>>>>>>> KStream#through
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
> >>> auto-generated and
> >>>>>>>>>>>>>>> num of
> >>>>>>>>>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions)
> >>> - topic
> >>>>>>>>>>>>> will
> >>>>>>>>>>>>>>> be auto
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions,
> >>> final
> >>>>>>>>>>>>>>> Produced<K, V>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated
> >>> with
> >>>>>>>>>>>>>>> specified num of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
> >>> parameter.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
> >>> configuration to
> >>>>>>>>>>>>> Grouped
> >>>>>>>>>>>>>>> class (as
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator
> >>> KStream#repartition for
> >>>>>>>>>>>>>>> creating and
> >>>>>>>>>>>>>>>>>>>>>>>>>> managing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was
> >>> already
> >>>>>>>>>>>>>>> discussed in the
> >>>>>>>>>>>>>>>>>>>>>>>>>> mailing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads
> >>> that were
> >>>>>>>>>>>>>>> about this
> >>>>>>>>>>>>>>>>>>>>>>>>>> KIP :(
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani
> >>> Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> >>>>>>>>>>>>> levani.codes@gmail.com>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion
> >>> around
> >>>>>>>>>>>>>>> KIP-221. Going
> >>>>>>>>>>>>>>>>>>>>>>>>>> through
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to
> >>> agreement
> >>>>>>>>>>>>>>> around
> >>>>>>>>>>>>>>>>>>>>>>>>>> usefulness of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feature.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far
> >>> as I
> >>>>>>>>>>>>>>> understood, the
> >>>>>>>>>>>>>>>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the
> >>> following:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to
> >>> KStream#through
> >>>>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>>>>>>>>>>> (essentially
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with
> >>> numOfPartitions
> >>>>>>>>>>>>>>>>>>>>>>>> configuration
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> field.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL
> >>> users to
> >>>>>>>>>>>>> control
> >>>>>>>>>>>>>>>>>>>>>>>>>> parallelism and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful
> >>>>>>>>>>>>> operations.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface
> >>> changes around
> >>>>>>>>>>>>>>>>>>>>>>>>>> KStream#through if
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >>>
> >>
> >
>


Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by Levani Kokhreidze <le...@gmail.com>.
Hello,

Just fyi, PR was updated and now it incorporates the latest suggestions about joins. 
`CopartitionedTopicsEnforcer` will throw an exception if number of partitions aren’t the same when using `repartition` operation along with `join`.

For more details please take a look at the PR: https://github.com/apache/kafka/pull/7170/files <https://github.com/apache/kafka/pull/7170/files>

Regards,
Levani 


> On Nov 15, 2019, at 11:01 AM, Matthias J. Sax <ma...@confluent.io> wrote:
> 
> Thanks a lot for the input Sophie.
> 
> Your example is quite useful, and I would use it to support my claim
> that a "partition hint" for `Grouped` seems "useless" and does not
> improve the user experience.
> 
> 1) You argue that a new user would be worries about repartitions topics
> with too many paritions. This would imply that a user is already
> advanced enough to understand the implication of repartitioning -- for
> this case, I would argue that a user also understand _when_ a
> auto-repartitioning would happen and thus the users understands where to
> insert a `repartition()` operation.
> 
> 2) For specifying Serdes: if a `groupByKey()` does not trigger
> auto-repartitioning it's not required to specify the serdes and if they
> are specified they would be ignored/unused (note, that `groupBy()` would
> always trigger a repartitioning). Of course, if the default Serdes from
> the config match (eg, all data types are Json anyway), a user does not
> need to worry about specifying serdes. -- For new user that play around,
> I would assume that they work a lot with primitive types and thus would
> need to specify the serdes -- hence, they would learn about
> auto-repartitioning the hard way anyhow, because each time a
> `groupByKey()` does trigger auto-repartioning, they would need to pass
> in the correct Serdes -- this way, they would also be educated where to
> insert a `repartition()` operator if needed.
> 
> 3) If a new user really just "plays around", I don't think they use an
> input topic with 100 partitions but most likely have a local single node
> broker with most likely single partitions topics.
> 
> 
> My main argument for my current proposal is however, that---based on
> past experience---it's better to roll out a new feature more carefully
> and see how it goes. Last, as John pointed out, we can still extend the
> feature in the future. Instead of making a judgment call up-front, being
> more conservative and less fancy, and revisit the design based on
> actuall user feedback after the first version is rolled out, seems to be
> the better option. Undoing a feature is must harder than extending it.
> 
> 
> While I advocate strong for a simple first version of this feature, it's
> a community decission in the end, and I would not block this KIP if
> there is a broad preference to add `Grouped#withNumberOfPartitions()`
> either.
> 
> 
> -Matthias
> 
> On 11/14/19 11:35 PM, Sophie Blee-Goldman wrote:
>> It seems like we all agree at this point (please correct me if wrong!) that
>> we should NOT change
>> the existing repartitioning behavior, ie we should allow Streams to
>> continue to determine when and
>> where to repartition -- *unless* explicitly informed to by the use of a
>> .through or the new .repartition operator.
>> 
>> Regarding groupBy, the existing behavior we should not disrupt is
>> a) repartition *only* when required due to upstream key-changing operation
>> (ie don't force repartitioning
>> based on the presence of an optional config parameter), and
>> b) allow optimization of required repartitions, if any
>> 
>> Within the constraint of not breaking the existing behavior, this still
>> leaves open the question of whether we
>> want to improve the user experience by allowing to provide groupBy with a
>> *suggestion* for numPartitions (or to
>> put it more fairly, whether that *will* improve the experience). I agree
>> with many of the arguments outlined above but
>> let me just push back on this one issue one final time, and if we can't
>> come to a consensus then I am happy to drop
>> it for now so that the KIP can proceed.
>> 
>> Specifically, my proposal would be to simply augment Grouped with an
>> optional numPartitions, understood to
>> indicate the user's desired number of partitions *if Streams decides to
>> repartition due to that groupBy*
>> 
>>> if a user cares about the number of partition, the user wants to enforce
>> a repartitioning
>> First, I think we should take a step back and examine this claim. I agree
>> 100% that *if this is true,*
>> *then we should not give groupBy an optional numPartitions.* As far as I
>> see it, there's no argument
>> to be had there if we *presuppose that claim.* But I'm not convinced in
>> that as an axiom of the user
>> experience and think we should be examining that claim itself, not the
>> consequences of it.
>> 
>> To give a simple example, let's say some new user is trying out Streams and
>> wants to just play around
>> with it to see if it might be worth looking into. They want to just write
>> up a simple app and test it out on the
>> data in some existing topics they have with a large number of partitions,
>> and a lot of data. They're just messing
>> around, trying new topologies and don't want to go through each new one
>> step by step to determine if (or where)
>> a repartition might be required. They also don't want to force a
>> repartition if it turns out to not be required, so they'd
>> like to avoid the nice new .repartition operator they saw. But given the
>> huge number of input partitions, they'd like
>> to rest assured that if a repartition does end up being required somewhere
>> during dev, it will not be created with
>> the same huge number of partitions that their input topic has -- so they
>> just pass groupBy a small numPartitions
>> suggestion.
>> 
>> I know that's a bit of a contrived example but I think it does highlight
>> how and when this might be a considerable
>> quality of life improvement, in particular for new users to Streams and/or
>> during the dev cycle. *You don't want to*
>> *force a repartition if it wasn't necessary, but you don't want to create a
>> topic with a huge partition count either.*
>> 
>> Also, while the optimization discussion took us down an interesting but
>> ultimately more distracting road, it's worth
>> pointing out that it is clearly a major win to have as few
>> repartition topics/steps as possible. Given that we
>> don't want to change existing behavior, the optimization framework can only
>> help out when the placement of
>> repartition steps is flexible, which means only those from .groupBy (and
>> not .repartition). *Users should not*
>> *have to choose between allowing Streams to optimize the repartition
>> placement, and allowing to specify a *
>> *number of partitions.*
>> 
>> Lastly, I have what may be a stupid question but for my own edification of
>> how groupBy works:
>> if you do a .groupBy and a repartition is NOT required, does it ever need
>> to serialize/deserialize
>> any of the data? In other words, if you pass a key/value serde to groupBy
>> and it doesn't trigger
>> a repartition, is the serde(s) just ignored and thus more like a suggestion
>> than a requirement?
>> 
>> So again, I don't want to hold up this KIP forever but I feel we've spent
>> some time getting slightly
>> off track (although certainly into very interesting discussions) yet never
>> really addressed or questioned
>> the basic premise: *could a user want to specify a number of partitions but
>> not enforce a repartition (at that*
>> *specific point in the topology)?*
>> 
>> 
>> 
>> On Fri, Nov 15, 2019 at 12:18 AM Matthias J. Sax <ma...@confluent.io>
>> wrote:
>> 
>>> Side remark:
>>> 
>>> If the user specifies `repartition()` on both side of the join, we can
>>> actually throw the execption earlier, ie, when we build the topology.
>>> 
>>> Current, we can do this check only after Kafka Streams was started,
>>> within `StreamPartitionAssignor#assign()` -- we still need to keep this
>>> check for the case that none or only one side has a user specified
>>> number of partitions though.
>>> 
>>> 
>>> -Matthias
>>> 
>>> On 11/14/19 8:15 AM, John Roesler wrote:
>>>> Thanks, all,
>>>> 
>>>> I can get behind just totally leaving out reparation-via-groupBy. If
>>>> we only introduce `repartition()` for now, we're making the minimal
>>>> change to gain the desired capability.
>>>> 
>>>> Plus, since we agree that `repartition()` should never be optimizable,
>>>> it's a future-compatible proposal. I.e., if we were to add a
>>>> non-optimizable groupBy(partitions) operation now, and want to make it
>>>> optimizable in the future, we have to worry about topology
>>>> compatibility. Better to just do non-optimizable `repartition()` now,
>>>> and add an optimizable `groupBy(partitions)` in the future (maybe).
>>>> 
>>>> About joins, yes, it's a concern, and IMO we should just do the same
>>>> thing we do now... check at runtime that the partition counts on both
>>>> sides match and throw an exception otherwise. What this means as a
>>>> user is that if you explicitly repartition the left side to 100
>>>> partitions, and then join with the right side at 10 partitions, you
>>>> get an exception, since this operation is not possible. You'd either
>>>> have to "step down" the left side again, back to 10 partitions, or you
>>>> could repartition the right side to 100 partitions before the join.
>>>> The choice has to be the user's, since it depends on their desired
>>>> execution parallelism.
>>>> 
>>>> Thanks,
>>>> -John
>>>> 
>>>> On Thu, Nov 14, 2019 at 12:55 AM Matthias J. Sax <ma...@confluent.io>
>>> wrote:
>>>>> 
>>>>> Thanks a lot John. I think the way you decompose the operators is super
>>>>> helpful for this discussion.
>>>>> 
>>>>> What you suggest with regard to using `Grouped` and enforcing
>>>>> repartitioning if the number of partitions is specified is certainly
>>>>> possible. However, I am not sure if we _should_ do this. My reasoning is
>>>>> that an enforce repartitioning as introduced via `repartition()` is an
>>>>> expensive operations, and it seems better to demand an more explicit
>>>>> user opt-in to trigger it. Just setting an optional parameter might be
>>>>> too subtle to trigger such a heavy "side effect".
>>>>> 
>>>>> While I agree about "usability" in general, I would prefer a more
>>>>> conservative appraoch to introduce this feature, see how it goes, and
>>>>> maybe make it more advance later on. This also applies to what
>>>>> optimzation we may or may not allow (or are able to perform at all).
>>>>> 
>>>>> @Levani: Reflecting about my suggestion about `Repartioned extends
>>>>> Grouped`, I agree that it might not be a good idea.
>>>>> 
>>>>> Atm, I see an enforces repartitioning as non-optimizable and as a good
>>>>> first step and I would suggest to not intoruce anything else for now.
>>>>> Introducing optimizable enforce repartitioning via `groupBy(...,
>>>>> Grouped)` is something we could add later.
>>>>> 
>>>>> 
>>>>> Therefore, I would not change `Grouped` but only introduce
>>>>> `repartition()`. Users that use `grouBy()` atm, and want to opt-in to
>>>>> set the number of partitions, would need to rewrite their code to
>>>>> `selectKey(...).repartition(...).groupByKey()`. It's less convinient but
>>>>> also less risky from an API and optimization point of view.
>>>>> 
>>>>> 
>>>>> @Levani: about joins -> yes, we will need to check the specified number
>>>>> of partitions (if any) and if they don't match, throw an exception. We
>>>>> can discuss this on the PR -- I am just trying to get the PR for KIP-466
>>>>> merged -- your is next on the list :)
>>>>> 
>>>>> 
>>>>> Thoughts?
>>>>> 
>>>>> 
>>>>> -Matthias
>>>>> 
>>>>> 
>>>>> On 11/12/19 4:51 PM, Levani Kokhreidze wrote:
>>>>>> Thank you all for an interesting discussion. This is very enlightening.
>>>>>> 
>>>>>> Thank you Matthias for your explanation. Your arguments are very true.
>>> It makes sense that if user specifies number of partitions he/she really
>>> cares that those specifications are applied to internal topics.
>>>>>> Unfortunately, in current implementation this is not true during
>>> `join` operation. As I’ve written in the PR comment, currently, when
>>> `Stream#join` is used, `CopartitionedTopicsEnforcer` chooses max number of
>>> partitions from the two source topics.
>>>>>> I’m not really sure what would be the other way around this situation.
>>> Maybe fail the stream altogether and inform the user to specify same number
>>> of partitions?
>>>>>> Or we should treat join operations in a same way as it is right now
>>> and basically choose max number of partitions even when `repartition`
>>> operation is specified, because Kafka Streams “knows the best” how to
>>> handle joins?
>>>>>> You can check integration tests how it’s being handled currently. Open
>>> to suggestions on that part.
>>>>>> 
>>>>>> As for groupBy, I agree and John raised very interesting points. My
>>> arguments for allowing users to specify number of partitions during groupBy
>>> operations mainly was coming from the usability perspective.
>>>>>> So building on top of what John said, maybe it makes sense to make
>>> `groupBy` operations smarter and whenever user specifies
>>> `numberOfPartitions` configuration, repartitioning will be enforced, wdyt?
>>>>>> I’m not going into optimization part yet :) I think it will be part of
>>> separate PR and task, but overall it makes sense to apply optimizations
>>> where number of partitions are the same.
>>>>>> 
>>>>>> As for Repartitioned extending Grouped, I kinda feel that it won’t fit
>>> nicely in current API design.
>>>>>> In addition, in the PR review, John mentioned that there were a lot of
>>> troubles in the past trying to use one operation's configuration objects on
>>> other operations.
>>>>>> Also it makes sense to keep them separate in terms of compatibility.
>>>>>> In that case, we don’t have to worry every time Grouped is changed,
>>> what would be the implications on `repartition` operations.
>>>>>> 
>>>>>> Kind regards,
>>>>>> Levani
>>>>>> 
>>>>>> 
>>>>>>> On Nov 11, 2019, at 9:13 PM, John Roesler <jo...@confluent.io> wrote:
>>>>>>> 
>>>>>>> Ah, thanks for the clarification. I missed your point.
>>>>>>> 
>>>>>>> I like the framework you've presented. It does seem simpler to assume
>>>>>>> that they either care about the partition count and want to
>>>>>>> repartition to realize it, or they don't care about the number.
>>>>>>> Returning to this discussion, it does seem unlikely that they care
>>>>>>> about the number and _don't_ care if it actually gets realized.
>>>>>>> 
>>>>>>> But then, it still seems like we can just keep the option as part of
>>>>>>> Grouped. As in:
>>>>>>> 
>>>>>>> // user does not care
>>>>>>> stream.groupByKey(Grouped /*not specifying partition count*/)
>>>>>>> stream.groupBy(Grouped /*not specifying partition count*/)
>>>>>>> 
>>>>>>> // user does care
>>>>>>> stream.repartition(Repartitioned)
>>>>>>> stream.groupByKey(Grouped.numberOfPartitions(...))
>>>>>>> stream.groupBy(Grouped.numberOfPartitions(...))
>>>>>>> 
>>>>>>> ----
>>>>>>> 
>>>>>>> The above discussion got me thinking about algebra. Matthias is
>>>>>>> absolutely right that `groupByKey(numPartitions)` is equivalent to
>>>>>>> `repartition(numPartitions).groupByKey()`. I'm just not convinced that
>>>>>>> we should force people to apply that expansion themselves vs. having a
>>>>>>> more compact way to express it if they don't care where exactly the
>>>>>>> repartition occurs. However, thinking about these operators
>>>>>>> algebraically can really help *us* narrow down the number of different
>>>>>>> expressions we have to consider.
>>>>>>> 
>>>>>>> Let's consider some identities:
>>>>>>> 
>>>>>>> A: groupBy(mapper) + agg = mapKey(mapper) + groupByKey + agg
>>>>>>> B: src + ... + groupByKey + agg = src + ... + passthough + agg
>>>>>>> C: mapKey(mapper) + ... + groupByKey + agg
>>>>>>> = mapKey(mapper) + ... + repartition + groupByKey + agg
>>>>>>> D: repartition = sink(managed) + src
>>>>>>> 
>>>>>>> In these identities, I used one special identifier (...), which means
>>>>>>> any number (0+) of operations that are not src, mapKey, groupBy[Key],
>>>>>>> repartition, or agg.
>>>>>>> 
>>>>>>> For mental clarity, I'm just going to make up a rule that groupBy
>>>>>>> operations are not executable. In other words, you have to get to a
>>>>>>> point where you can apply B to convert a groupByKey into a passthough
>>>>>>> in order to execute the program. This is just a formal way of stating
>>>>>>> what already happens in Kafka Streams.
>>>>>>> 
>>>>>>> By applying A, we can just completely leave `groupBy` out of our
>>>>>>> analysis. It trivially decomposes into a mapKey followed by a
>>>>>>> groupByKey.
>>>>>>> 
>>>>>>> Then, we can eliminate the "repartition required" case of `groupByKey`
>>>>>>> by applying C followed by D to get to the "no repartition required"
>>>>>>> version of groupByKey, which in turn sets us up to apply B to get an
>>>>>>> executable topology.
>>>>>>> 
>>>>>>> Fundamentally, you can think about KIP-221 is as proposing a modified
>>>>>>> D identity in which you can specify the partition count of the managed
>>>>>>> sink topic:
>>>>>>> D': repartition(pc) = sink(managed w/ pc) + src
>>>>>>> 
>>>>>>> Since users _could_ apply the identities above, we don't actually have
>>>>>>> to add any partition count to groupBy[Key], but we decided early on in
>>>>>>> the KIP discussion that it's more ergonomic to add it. In that case,
>>>>>>> we also have to modify A and C:
>>>>>>> A': groupBy(mapper, pc) + agg
>>>>>>> = mapKey(mapper) + groupByKey(pc) + agg
>>>>>>> C': mapKey(mapper) + ... + groupByKey(pc) + agg
>>>>>>> = mapKey(mapper) + ... + repartition(pc) + groupByKey + agg
>>>>>>> 
>>>>>>> Which sets us up still to always be able to get back to a plain
>>>>>>> `groupByKey` operation (with no `(pc)`) and then apply D' and
>>>>>>> ultimately B to get an executable topology.
>>>>>>> 
>>>>>>> What about the optimizer?
>>>>>>> The optimizer applies another set of graph-algebraic identities to
>>>>>>> minimize the number of repartition topics in a topology.
>>>>>>> 
>>>>>>> (forgive my ascii art)
>>>>>>> 
>>>>>>> E: (merging repartition nodes)
>>>>>>> (...) -> repartition -> X
>>>>>>> \-> repartition -> Y
>>>>>>> =
>>>>>>> (... + repartition) -> X
>>>>>>>    \-> Y
>>>>>>> F: (reordering around repartition)
>>>>>>> Where SVO is any non-key-changing, stateless, operation:
>>>>>>> repartition -> SVO = SVO -> repartition
>>>>>>> 
>>>>>>> In terms of these identities, what the optimizer does is apply F
>>>>>>> repeatedly in either direction to a topology to factor out common in
>>>>>>> branches so that it can apply E to merge repartition nodes. This was
>>>>>>> especially necessary before KIP-221 because you couldn't directly
>>>>>>> express `repartition` in the DSL, only indirectly via `groupBy[Key]`,
>>>>>>> so there was no way to do the factoring manually.
>>>>>>> 
>>>>>>> We can now state very clearly that in KIP-221, explicit
>>>>>>> `repartition()` operators should create a "reordering barrier". So, F
>>>>>>> cannot be applied to an explicit `repartition()`. Also, I think we
>>>>>>> decided earlier that explicit `repartition()` operations would also be
>>>>>>> ineligible for merging, so E can't be applied to explicit
>>>>>>> `repartition()` operations either. I think we feel we _could_ apply E
>>>>>>> without harm, but we want to be conservative for now.
>>>>>>> 
>>>>>>> I think the salient point from the latter discussion has been that
>>>>>>> when you use `Grouped.numberOfPartitions`, this does _not_ constitute
>>>>>>> an explicit `repartition()` operator, and therefore, the resulting
>>>>>>> repartition node remains eligible for optimization.
>>>>>>> 
>>>>>>> To be clear, I agree with Matthias that the provided partition count
>>>>>>> _must_ be used in the resulting implicit repartition. This has some
>>>>>>> implications for E. Namely, E could only be applied to two repartition
>>>>>>> nodes that have the same partition count. This has always been
>>>>>>> trivially true before KIP-221 because the partition count has always
>>>>>>> been "unspecified", i.e., it would be determined at runtime by the
>>>>>>> user-managed-topics' partition counts. Now, it could be specified or
>>>>>>> unspecified. We can simply augment E to allow merging only repartition
>>>>>>> nodes where the partition count is EITHER "specified and the same on
>>>>>>> both sides", OR "unspecified on both sides".
>>>>>>> 
>>>>>>> Sorry for the long email, but I have a hope that it builds a solid
>>>>>>> theoretical foundation for our decisions in KIP-221, so we can have
>>>>>>> confidence that there are no edge cases for design flaws to hide.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> -John
>>>>>>> 
>>>>>>> On Sat, Nov 9, 2019 at 10:37 PM Matthias J. Sax <
>>> matthias@confluent.io> wrote:
>>>>>>>> 
>>>>>>>>> it seems like we do want to allow
>>>>>>>>>> people to optionally specify a partition count as part of this
>>>>>>>>>> operation, but we don't want that option to _force_ repartitioning
>>>>>>>> 
>>>>>>>> Correct, ie, that is my suggestions.
>>>>>>>> 
>>>>>>>>> "Use P partitions if repartitioning is necessary"
>>>>>>>> 
>>>>>>>> I disagree here, because my reasoning is that:
>>>>>>>> 
>>>>>>>> - if a user cares about the number of partition, the user wants to
>>>>>>>> enforce a repartitioning
>>>>>>>> - if a user does not case about the number of partitions, we don't
>>> need
>>>>>>>> to provide them a way to pass in a "hint"
>>>>>>>> 
>>>>>>>> Hence, it should be sufficient to support:
>>>>>>>> 
>>>>>>>> // user does not care
>>>>>>>> 
>>>>>>>> `stream.groupByKey(Grouped)`
>>>>>>>> `stream.grouBy(..., Grouped)`
>>>>>>>> 
>>>>>>>> // user does care
>>>>>>>> 
>>>>>>>> `stream.repartition(Repartitioned).groupByKey()`
>>>>>>>> `streams.groupBy(..., Repartitioned)`
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -Matthias
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 11/9/19 8:10 PM, John Roesler wrote:
>>>>>>>>> Thanks for those thoughts, Matthias,
>>>>>>>>> 
>>>>>>>>> I find your reasoning about the optimization behavior compelling.
>>> The
>>>>>>>>> `through` operation is very simple and clear to reason about. It
>>> just
>>>>>>>>> passes the data exactly at the specified point in the topology
>>> exactly
>>>>>>>>> through the specified topic. Likewise, if a user invokes a
>>>>>>>>> `repartition` operator, the simplest behavior is if we just do what
>>>>>>>>> they asked for.
>>>>>>>>> 
>>>>>>>>> Stepping back to think about when optimizations are surprising and
>>>>>>>>> when they aren't, it occurs to me that we should be free to move
>>>>>>>>> around repartitions when users have asked to perform some operation
>>>>>>>>> that implies a repartition, like "change keys, then filter, then
>>>>>>>>> aggregate". This program requires a repartition, but it could be
>>>>>>>>> anywhere between the key change and the aggregation. On the other
>>>>>>>>> hand, if they say, "change keys, then filter, then repartition, then
>>>>>>>>> aggregate", it seems like they were pretty clear about their desire,
>>>>>>>>> and we should just take it at face value.
>>>>>>>>> 
>>>>>>>>> So, I'm sold on just literally doing a repartition every time they
>>>>>>>>> invoke the `repartition` operator.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The "partition count" modifier for `groupBy`/`groupByKey` is more
>>> nuanced.
>>>>>>>>> 
>>>>>>>>> What you said about `groupByKey` makes sense. If they key hasn't
>>>>>>>>> actually changed, then we don't need to repartition before
>>>>>>>>> aggregating. On the other hand, `groupBy` is specifically changing
>>> the
>>>>>>>>> key as part of the grouping operation, so (as you said) we
>>> definitely
>>>>>>>>> have to do a repartition.
>>>>>>>>> 
>>>>>>>>> If I'm reading the discussion right, it seems like we do want to
>>> allow
>>>>>>>>> people to optionally specify a partition count as part of this
>>>>>>>>> operation, but we don't want that option to _force_ repartitioning
>>> if
>>>>>>>>> it's not needed. That last clause is the key. "Use P partitions if
>>>>>>>>> repartitioning is necessary" is a directive that applies cleanly and
>>>>>>>>> correctly to both `groupBy` and `groupByKey`. What if we call the
>>>>>>>>> option `numberOfPartitionsHint`, which along with the "if necessary"
>>>>>>>>> javadoc, should make it clear that the option won't force a
>>>>>>>>> repartition, and also gives us enough latitude to still employ the
>>>>>>>>> optimizer on those repartition topics?
>>>>>>>>> 
>>>>>>>>> If we like the idea of expressing it as a "hint" for grouping and a
>>>>>>>>> "command" for `repartition`, then it seems like it still makes sense
>>>>>>>>> to keep Grouped and Repartitioned separate, as they would actually
>>>>>>>>> offer different methods with distinct semantics.
>>>>>>>>> 
>>>>>>>>> WDYT?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> -John
>>>>>>>>> 
>>>>>>>>> On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <
>>> matthias@confluent.io> wrote:
>>>>>>>>>> 
>>>>>>>>>> Sorry for late reply.
>>>>>>>>>> 
>>>>>>>>>> I guess, the question boils down to the intended semantics of
>>>>>>>>>> `repartition()`. My understanding is as follows:
>>>>>>>>>> 
>>>>>>>>>> - KS does auto-repartitioning for correctness reasons (using the
>>>>>>>>>> upstream topic to determine the number of partitions)
>>>>>>>>>> - KS does auto-repartitioning only for downstream DSL operators
>>> like
>>>>>>>>>> `count()` (eg, a `transform()` does never trigger an
>>> auto-repartitioning
>>>>>>>>>> even if the stream is marked as `repartitioningRequired`).
>>>>>>>>>> - KS offers `through()` to enforce a repartitioning -- however,
>>> the user
>>>>>>>>>> needs to create the topic manually (with the desired number of
>>> partitions).
>>>>>>>>>> 
>>>>>>>>>> I see two main applications for `repartitioning()`:
>>>>>>>>>> 
>>>>>>>>>> 1) repartition data before a `transform()` but user does not want
>>> to
>>>>>>>>>> manage the topic
>>>>>>>>>> 2) scale out a downstream subtopology
>>>>>>>>>> 
>>>>>>>>>> Hence, I see `repartition()` similar to `through()`: if a users
>>> calls
>>>>>>>>>> it, a repartitining is enforced, with the difference that KS
>>> manages the
>>>>>>>>>> topic and the user does not need to create it.
>>>>>>>>>> 
>>>>>>>>>> This behavior makes (1) and (2) possible.
>>>>>>>>>> 
>>>>>>>>>>> I think many users would prefer to just say "if there *is* a
>>> repartition
>>>>>>>>>>> required at this point in the topology, it should
>>>>>>>>>>> have N partitions"
>>>>>>>>>> 
>>>>>>>>>> Because of (2), I disagree. Either a user does not care about
>>> scaling
>>>>>>>>>> out, for which case she would not specify the number of
>>> partitions. Or a
>>>>>>>>>> user does care, and hence wants to enforce the scale out. I don't
>>> think
>>>>>>>>>> that any user would say, "maybe scale out".
>>>>>>>>>> 
>>>>>>>>>> Therefore, the optimizer should never ignore the repartition
>>> operation.
>>>>>>>>>> As a "consequence" (because repartitioning is expensive) a user
>>> should
>>>>>>>>>> make an explicit call to `repartition()` IMHO -- piggybacking an
>>>>>>>>>> enforced repartitioning into `groupByKey()` seems to be "dangerous"
>>>>>>>>>> because it might be too subtle and an "optional scaling out" as
>>> laid out
>>>>>>>>>> above does not make sense IMHO.
>>>>>>>>>> 
>>>>>>>>>> I am also not worried about "over repartitioning" because the
>>> result
>>>>>>>>>> stream would never trigger auto-repartitioning. Only if multiple
>>>>>>>>>> consecutive calls to `repartition()` are made it could be bad --
>>> but
>>>>>>>>>> that's the same with `through()`. In the end, there is always some
>>>>>>>>>> responsibility on the user.
>>>>>>>>>> 
>>>>>>>>>> Btw, for `.groupBy()` we know that repartitioning will be required,
>>>>>>>>>> however, for `groupByKey()` it depends if the KStream is marked as
>>>>>>>>>> `repartitioningRequired`.
>>>>>>>>>> 
>>>>>>>>>> Hence, for `groupByKey()` it should not be possible for a user to
>>> set
>>>>>>>>>> number of partitions IMHO. For `groupBy()` it's a different story,
>>>>>>>>>> because calling
>>>>>>>>>> 
>>>>>>>>>>  `repartition().groupBy()`
>>>>>>>>>> 
>>>>>>>>>> does not achieve what we want. Hence, allowing users to pass in the
>>>>>>>>>> number of users partitions into `groupBy()` does actually makes
>>> sense,
>>>>>>>>>> because repartitioning will happen anyway and thus we can
>>> piggyback a
>>>>>>>>>> scaling decision.
>>>>>>>>>> 
>>>>>>>>>> I think that John has a fair concern about the overloads, however,
>>> I am
>>>>>>>>>> not convinced that using `Grouped` to specify the number of
>>> partitions
>>>>>>>>>> is intuitive. I double checked `Grouped` and `Repartitioned` and
>>> both
>>>>>>>>>> allow to specify a `name` and `keySerde/valueSerde`. Thus, I am
>>>>>>>>>> wondering if we could bridge the gap between both, if we would make
>>>>>>>>>> `Repartitioned extends Grouped`? For this case, we only need
>>>>>>>>>> `groupBy(Grouped)` and a user can pass in both types what seems to
>>> make
>>>>>>>>>> the API quite smooth:
>>>>>>>>>> 
>>>>>>>>>> `stream.groupBy(..., Grouped...)`
>>>>>>>>>> 
>>>>>>>>>> `stream.groupBy(..., Repartitioned...)`
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thoughts?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -Matthias
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
>>>>>>>>>>> Hi Sophie,
>>>>>>>>>>> 
>>>>>>>>>>> Thank you for your reply, very insightful. Looking forward
>>> hearing others opinion as well on this.
>>>>>>>>>>> 
>>>>>>>>>>> Kind regards,
>>>>>>>>>>> Levani
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <
>>> sophie@confluent.io> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
>>> other hand
>>>>>>>>>>>> Kafka Streams has already
>>>>>>>>>>>>> optimizer in place which alters topology independently from user
>>>>>>>>>>>> 
>>>>>>>>>>>> I agree (with you) and think this is a good way to put it -- we
>>> currently
>>>>>>>>>>>> auto-repartition for the user so
>>>>>>>>>>>> that they don't have to walk through their entire topology and
>>> reason about
>>>>>>>>>>>> when and where to place a
>>>>>>>>>>>> `.through` (or the new `.repartition`), so why suddenly force
>>> this onto the
>>>>>>>>>>>> user? How certain are we that
>>>>>>>>>>>> users will always get this right? It's easy to imagine that
>>> during
>>>>>>>>>>>> development, you write your new app with
>>>>>>>>>>>> correctly placed repartitions in order to use this new feature.
>>> During the
>>>>>>>>>>>> course of development you end up
>>>>>>>>>>>> tweaking the topology, but don't remember to review or move the
>>>>>>>>>>>> repartitioning since you're used to Streams
>>>>>>>>>>>> doing this for you. If you use only single-partition topics for
>>> testing,
>>>>>>>>>>>> you might not even notice your app is
>>>>>>>>>>>> spitting out incorrect results!
>>>>>>>>>>>> 
>>>>>>>>>>>> Anyways, I feel pretty strongly that it would be weird to
>>> introduce a new
>>>>>>>>>>>> feature and say that to use it, you can't take
>>>>>>>>>>>> advantage of this other feature anymore. Also, is it possible our
>>>>>>>>>>>> optimization framework could ever include an
>>>>>>>>>>>> optimized repartitioning strategy that is better than what a
>>> user could
>>>>>>>>>>>> achieve by manually inserting repartitions?
>>>>>>>>>>>> Do we expect users to have a deep understanding of the best way
>>> to
>>>>>>>>>>>> repartition their particular topology, or is it
>>>>>>>>>>>> likely they will end up over-repartitioning either due to missed
>>>>>>>>>>>> optimizations or unnecessary extra repartitions?
>>>>>>>>>>>> I think many users would prefer to just say "if there *is* a
>>> repartition
>>>>>>>>>>>> required at this point in the topology, it should
>>>>>>>>>>>> have N partitions"
>>>>>>>>>>>> 
>>>>>>>>>>>> As to the idea of adding `numberOfPartitions` to Grouped rather
>>> than
>>>>>>>>>>>> adding a new parameter to groupBy, that does seem more in line
>>> with the
>>>>>>>>>>>> current syntax so +1 from me
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <
>>> levani.codes@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> While https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> is under review and
>>> it’s
>>>>>>>>>>>>> almost done, I want to resurrect discussion about this KIP to
>>> address
>>>>>>>>>>>>> couple of concerns raised by Matthias and John.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> As a reminder, idea of the KIP-221 was to allow DSL users
>>> control over
>>>>>>>>>>>>> repartitioning and parallelism of sub-topologies by:
>>>>>>>>>>>>> 1) Introducing new KStream#repartition operation which is done
>>> in
>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>
>>>>>>>>>>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which is
>>> planned to
>>>>>>>>>>>>> be separate PR.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> While all agree about general implementation and idea behind
>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> PR, introducing new
>>>>>>>>>>>>> KStream#groupBy(Repartitioned) method overload raised some
>>> questions during
>>>>>>>>>>>>> the review.
>>>>>>>>>>>>> Matthias raised concern that there can be cases when user uses
>>>>>>>>>>>>> `KStream#groupBy(Repartitioned)` operation, but actual
>>> repartitioning may
>>>>>>>>>>>>> not required, thus configuration passed via `Repartitioned`
>>> would never be
>>>>>>>>>>>>> applied (Matthias, please correct me if I misinterpreted your
>>> comment).
>>>>>>>>>>>>> So instead, if user wants to control parallelism of
>>> sub-topologies, he or
>>>>>>>>>>>>> she should always use `KStream#repartition` operation before
>>> groupBy. Full
>>>>>>>>>>>>> comment can be seen here:
>>>>>>>>>>>>> 
>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
>>>>>>>>>>>>> 
>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On the same topic, John pointed out that, from API design
>>> perspective, we
>>>>>>>>>>>>> shouldn’t intertwine configuration classes of different
>>> operators between
>>>>>>>>>>>>> one another. So instead of introducing new
>>> `KStream#groupBy(Repartitioned)`
>>>>>>>>>>>>> for specifying number of partitions for internal topic, we
>>> should update
>>>>>>>>>>>>> existing `Grouped` class with `numberOfPartitions` field.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
>>> other hand
>>>>>>>>>>>>> Kafka Streams has already optimizer in place which alters
>>> topology
>>>>>>>>>>>>> independently from user. So maybe it makes sense if Kafka
>>> Streams,
>>>>>>>>>>>>> internally would optimize topology in the best way possible,
>>> even if in
>>>>>>>>>>>>> some cases this means ignoring some operator configurations
>>> passed by the
>>>>>>>>>>>>> user. Also, I agree with John about API design semantics. If we
>>> go through
>>>>>>>>>>>>> with the changes for `KStream#groupBy` operation, it makes more
>>> sense to
>>>>>>>>>>>>> add `numberOfPartitions` field to `Grouped` class instead of
>>> introducing
>>>>>>>>>>>>> new `KStream#groupBy(Repartitioned)` method overload.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I would really appreciate communities feedback on this.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>> Levani
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <
>>> sophie@confluent.io>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I think people are busy with the upcoming 2.4 release, and
>>> don't have
>>>>>>>>>>>>> much
>>>>>>>>>>>>>> spare time at the
>>>>>>>>>>>>>> moment. It's kind of a difficult time to get attention on
>>> things, but
>>>>>>>>>>>>> feel
>>>>>>>>>>>>>> free to pick up something else
>>>>>>>>>>>>>> to work on in the meantime until things have calmed down a bit!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sorry for bringing this thread again, but I would like to get
>>> some
>>>>>>>>>>>>>>> attention on this PR:
>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> <
>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>>
>>>>>>>>>>>>>>> It's been a while now and I would love to move on to other
>>> KIPs as well.
>>>>>>>>>>>>>>> Please let me know if you have any concerns.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Here’s voting thread for this KIP:
>>>>>>>>>>>>>>> 
>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>
>>> <
>>>>>>>>>>>>>>> 
>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html
>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>>
>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion on
>>> that one.
>>>>>>>>>>>>>>>>> Agree that avoiding unnecessary method overloads is a good
>>> idea.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Updated KIP
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <
>>> matthias@confluent.io
>>>>>>>>>>>>> <ma...@confluent.io>
>>>>>>>>>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>>>
>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> One question:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Why do we add
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Repartitioned#with(final String name, final int
>>> numberOfPartitions)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> It seems that `#with(String name)`,
>>> `#numberOfPartitions(int)` in
>>>>>>>>>>>>>>>>>> combination with `withName()` and
>>> `withNumberOfPartitions()` should
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>> sufficient. Users can chain the method calls.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> (I think it's valuable to keep the number of overload
>>> small if
>>>>>>>>>>>>>>> possible.)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Otherwise LGTM.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks all for your feedback.
>>>>>>>>>>>>>>>>>>> I started voting procedure for this KIP. If there’re any
>>> other
>>>>>>>>>>>>>>> concerns about this KIP, please let me know.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>> <mailto:
>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, makes sense.
>>>>>>>>>>>>>>>>>>>> I’ve updated KIP (
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>> <
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>> <
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>> <
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>> <
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> ).
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
>>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>
>>>>>>>>>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>>
>>> <mailto:
>>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>>>>
>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks for driving the KIP.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I agree that users need to be able to specify a
>>> partitioning
>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Sophie raises a fair point about topic configs and
>>> producer
>>>>>>>>>>>>>>> configs. My
>>>>>>>>>>>>>>>>>>>>> take is, that consider `Repartitioned` as an
>>> "extension" to
>>>>>>>>>>>>>>> `Produced`,
>>>>>>>>>>>>>>>>>>>>> that adds topic configuration, is a good way to think
>>> about it and
>>>>>>>>>>>>>>> helps
>>>>>>>>>>>>>>>>>>>>> to keep the API "clean".
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> With regard to method names. I would prefer to avoid
>>>>>>>>>>>>> abbreviations.
>>>>>>>>>>>>>>> Can
>>>>>>>>>>>>>>>>>>>>> we rename:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Furthermore, it might be good to add some more `static`
>>> methods:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>>>>>>>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>>>>>>>>>>>>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>> Totally agree. I think in KStream interface it makes
>>> sense to
>>>>>>>>>>>>> have
>>>>>>>>>>>>>>> some duplicate configurations between operators in order to
>>> keep API
>>>>>>>>>>>>> simple
>>>>>>>>>>>>>>> and usable.
>>>>>>>>>>>>>>>>>>>>>> Also, as more surface API has, harder it is to have
>>> proper
>>>>>>>>>>>>>>> backward compatibility.
>>>>>>>>>>>>>>>>>>>>>> While initial idea of keeping topic level configs
>>> separate was
>>>>>>>>>>>>>>> exciting, having Repartitioned class encapsulate some
>>> producer level
>>>>>>>>>>>>>>> configs makes API more readable.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I think that is a good point about trying to keep
>>> producer level
>>>>>>>>>>>>>>>>>>>>>>> configurations and (repartition) topic level
>>> considerations
>>>>>>>>>>>>>>> separate.
>>>>>>>>>>>>>>>>>>>>>>> Number of partitions is definitely purely a topic
>>> level
>>>>>>>>>>>>>>> configuration. But
>>>>>>>>>>>>>>>>>>>>>>> on some level, serdes and partitioners are just as
>>> much a topic
>>>>>>>>>>>>>>>>>>>>>>> configuration as a producer one. You could have two
>>> producers
>>>>>>>>>>>>>>> configured
>>>>>>>>>>>>>>>>>>>>>>> with different serdes and/or partitioners, but if
>>> they are
>>>>>>>>>>>>>>> writing to the
>>>>>>>>>>>>>>>>>>>>>>> same topic the result would be very difficult to
>>> part. So in a
>>>>>>>>>>>>>>> sense, these
>>>>>>>>>>>>>>>>>>>>>>> are configurations of topics in Streams, not just
>>> producers.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Another way to think of it: while the Streams API is
>>> not always
>>>>>>>>>>>>>>> true to
>>>>>>>>>>>>>>>>>>>>>>> this, ideally all the relevant configs for an
>>> operator are
>>>>>>>>>>>>>>> wrapped into a
>>>>>>>>>>>>>>>>>>>>>>> single object (in this case, Repartitioned). We could
>>> instead
>>>>>>>>>>>>>>> split out the
>>>>>>>>>>>>>>>>>>>>>>> fields in common with Produced into a separate
>>> parameter to keep
>>>>>>>>>>>>>>> topic and
>>>>>>>>>>>>>>>>>>>>>>> producer level configurations separate, but this
>>> increases the
>>>>>>>>>>>>>>> API surface
>>>>>>>>>>>>>>>>>>>>>>> area by a lot. It's much more straightforward to just
>>> say "this
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>> everything that this particular operator needs"
>>> without worrying
>>>>>>>>>>>>>>> about what
>>>>>>>>>>>>>>>>>>>>>>> exactly you're specifying.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I suppose you could alternatively make Produced a
>>> field of
>>>>>>>>>>>>>>> Repartitioned,
>>>>>>>>>>>>>>>>>>>>>>> but I don't think we do this kind of composition
>>> elsewhere in
>>>>>>>>>>>>>>> Streams at
>>>>>>>>>>>>>>>>>>>>>>> the moment
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>> <mailto:
>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>> <mailto:
>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>> <mailto:
>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi Bill,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>>>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
>>>>>>>>>>>>>>> `Repartitioned#partitioner`
>>>>>>>>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class for
>>> topic level
>>>>>>>>>>>>>>>>>>>>>>>> configuration and keep topic level and producer level
>>>>>>>>>>>>>>> configurations (such
>>>>>>>>>>>>>>>>>>>>>>>> as Produced) separately (see my second email in this
>>> thread).
>>>>>>>>>>>>>>>>>>>>>>>> But while looking at the semantics of KStream
>>> interface, I
>>>>>>>>>>>>>>> couldn’t really
>>>>>>>>>>>>>>>>>>>>>>>> figure out good operation name for Topic level
>>> configuration
>>>>>>>>>>>>>>> class and just
>>>>>>>>>>>>>>>>>>>>>>>> introducing `Topic` config class was kinda breaking
>>> the
>>>>>>>>>>>>>>> semantics.
>>>>>>>>>>>>>>>>>>>>>>>> So I think having Repartitioned class which
>>> encapsulates topic
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> producer level configurations for internal topics is
>>> viable
>>>>>>>>>>>>>>> thing to do.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <
>>> bbejeck@gmail.com
>>>>>>>>>>>>> <ma...@gmail.com>
>>>>>>>>>>>>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>>
>>> <mailto:
>>>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Lavani,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.  In
>>> addition to
>>>>>>>>>>>>>>> the reason
>>>>>>>>>>>>>>>>>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 1. Users may want to use something other than
>>> hash-based
>>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>>>>>>>>>>>> 2. Users may wish to partition on something
>>> different than the
>>>>>>>>>>>>>>> key
>>>>>>>>>>>>>>>>>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>>>>>>>>>>>>>>>>> 1. A combination of fields in the value in
>>> conjunction with
>>>>>>>>>>>>>>> the key
>>>>>>>>>>>>>>>>>>>>>>>>> 2. Something other than the key
>>>>>>>>>>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on
>>> Produced hence
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes sense
>>> for API
>>>>>>>>>>>>>>> consistency.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Just my  2 cents.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>> levani.codes@gmail.com>
>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>>> <mailto:
>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>> <mailto:
>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>>>>>>>>>>>>>>>>>> If we add partitioner configuration to
>>> Repartitioned class,
>>>>>>>>>>>>>>> with the
>>>>>>>>>>>>>>>>>>>>>>>>>> combination of specifying number of partitions for
>>> internal
>>>>>>>>>>>>>>> topics, user
>>>>>>>>>>>>>>>>>>>>>>>>>> will have opportunity to ensure co-partitioning
>>> before join
>>>>>>>>>>>>>>> operation.
>>>>>>>>>>>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>>>>>>>>>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io>
>>>>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>
>>> <mailto:
>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again,
>>> not totally
>>>>>>>>>>>>>>> sure it
>>>>>>>>>>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar is
>>> the
>>>>>>>>>>>>> rationale
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
>>>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
>>> levani.codes@gmail.com>
>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>
>>> <mailto:
>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey John,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your example
>>> I would
>>>>>>>>>>>>>>> first issue
>>>>>>>>>>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner
>>> that
>>>>>>>>>>>>>>> essentially
>>>>>>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>>>>>> be the same as the topic I want to join with and
>>> then do the
>>>>>>>>>>>>>>>>>>>>>>>> KStream#join
>>>>>>>>>>>>>>>>>>>>>>>>>> with DSL?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
>>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
>>> john@confluent.io
>>>>>>>>>>>>> <ma...@confluent.io>> <mailto:john@confluent.io <mailto:
>>>>>>>>>>>>> john@confluent.io>
>>>>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an option to
>>> specify
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is that
>>> some data may
>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input
>>> topic. If the
>>>>>>>>>>>>>>> right-side
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning
>>> strategy, then the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be
>>> partitioned with the
>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss
>>> something
>>>>>>>>>>>>>>> important?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani
>>> Kokhreidze
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
>>> levani.codes@gmail.com>
>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>
>>> <mailto:
>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be
>>> honest I’m
>>>>>>>>>>>>> not
>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really
>>> think I would
>>>>>>>>>>>>>>> need control
>>>>>>>>>>>>>>>>>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams
>>> knows best
>>>>>>>>>>>>>>> how to
>>>>>>>>>>>>>>>>>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be
>>> used only for
>>>>>>>>>>>>>>> topics
>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have
>>> possibility to
>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I
>>> guess
>>>>>>>>>>>>>>> specifying the
>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s
>>> agreement on
>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie
>>> Blee-Goldman <
>>>>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
>>> <mailto:
>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that
>>> Repartitioned
>>>>>>>>>>>>>>> would be a
>>>>>>>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need
>>> to have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar
>>> to
>>>>>>>>>>>>>>> Produced? I'm not
>>>>>>>>>>>>>>>>>>>>>>>>>> sure how
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but seems
>>> it should
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani
>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>
>>> <mailto:
>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka
>>> Streams.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more
>>> control over
>>>>>>>>>>>>>>> the topic
>>>>>>>>>>>>>>>>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is
>>> something that
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>> missing
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over
>>> parallelism by
>>>>>>>>>>>>>>> configuring
>>>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie
>>> Blee-Goldman <
>>>>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
>>> <mailto:
>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one
>>> thing for me
>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> for the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a
>>> Repartitioned,
>>>>>>>>>>>>>>> will the
>>>>>>>>>>>>>>>>>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be
>>> the case
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> signature
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to
>>> be
>>>>>>>>>>>>>>> pre-created? The
>>>>>>>>>>>>>>>>>>>>>>>>>> wording
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of
>>> the method
>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani
>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>> levani.codes@gmail.com>
>>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>
>>> <mailto:
>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>> <
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>> <
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>> <
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>> <
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>> <
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities
>>> opinions/concerns
>>>>>>>>>>>>> about
>>>>>>>>>>>>>>> this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani
>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani
>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve
>>> updated
>>>>>>>>>>>>>>> confluence
>>>>>>>>>>>>>>>>>>>>>>>> page
>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives”
>>> section.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani
>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing
>>> dedicated `Topic`
>>>>>>>>>>>>>>> class for
>>>>>>>>>>>>>>>>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators like
>>>>>>>>>>>>> `groupedBy`.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion
>>> about this
>>>>>>>>>>>>> as
>>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias
>>> J. Sax <
>>>>>>>>>>>>>>>>>>>>>>>>>> matthias@confluent.io
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And
>>> thanks for
>>>>>>>>>>>>>>> summarizing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been
>>> discussed
>>>>>>>>>>>>>>> already (can't
>>>>>>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good
>>> summary to
>>>>>>>>>>>>>>> refresh the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense.
>>> With regard
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics
>>> and
>>>>>>>>>>>>> operators
>>>>>>>>>>>>>>> that use
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it
>>> might
>>>>>>>>>>>>> indicate
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that
>>> uses
>>>>>>>>>>>>>>> use-defined
>>>>>>>>>>>>>>>>>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()`
>>> operator (an
>>>>>>>>>>>>>>> operator that
>>>>>>>>>>>>>>>>>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good.
>>> Otherwise, there is
>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages
>>> topics but
>>>>>>>>>>>>>>> sometimes
>>>>>>>>>>>>>>>>>>>>>>>> not; a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the
>>> distinction
>>>>>>>>>>>>>>> clearer.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to
>>> `Grouped`. I am
>>>>>>>>>>>>>>> wondering
>>>>>>>>>>>>>>>>>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply
>>> and adding
>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get
>>> opinions of
>>>>>>>>>>>>>>> others on
>>>>>>>>>>>>>>>>>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer
>>> atm.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects
>>> that allow
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> configure
>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer,
>>> store. If we
>>>>>>>>>>>>>>> assume that
>>>>>>>>>>>>>>>>>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we
>>> should have a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class
>>> and method
>>>>>>>>>>>>>>> instead of
>>>>>>>>>>>>>>>>>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add
>>> other
>>>>>>>>>>>>> configs,
>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily,
>>> without the
>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure
>>> if I like
>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>>>>>>>>> myself.
>>>>>>>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze
>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though -
>>> maybe
>>>>>>>>>>>>> enhancing
>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the
>>> best approach.
>>>>>>>>>>>>>>> Let me
>>>>>>>>>>>>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with
>>> this
>>>>>>>>>>>>>>> configuration,
>>>>>>>>>>>>>>>>>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since
>>> KStream#to is
>>>>>>>>>>>>>>> the final
>>>>>>>>>>>>>>>>>>>>>>>>>> sink of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable
>>>>>>>>>>>>> assumption
>>>>>>>>>>>>>>> that user
>>>>>>>>>>>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And
>>> in that
>>>>>>>>>>>>>>> case, having
>>>>>>>>>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much
>>> sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based
>>> on API
>>>>>>>>>>>>>>> contract, seems
>>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that
>>> is
>>>>>>>>>>>>>>> explicitly for
>>>>>>>>>>>>>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner
>>> those all
>>>>>>>>>>>>>>> are producer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is
>>> topic level
>>>>>>>>>>>>>>>>>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
>>>>>>>>>>>>>>> configurations
>>>>>>>>>>>>>>>>>>>>>>>>>> together in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface,
>>> seems like
>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal
>>> (e.g
>>>>>>>>>>>>> topics
>>>>>>>>>>>>>>> created
>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I
>>> think we
>>>>>>>>>>>>>>> should leave
>>>>>>>>>>>>>>>>>>>>>>>>>> it as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account,
>>> I think we
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka
>>> Streams should
>>>>>>>>>>>>>>> create and
>>>>>>>>>>>>>>>>>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs
>>> topics that
>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>>>>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding
>>> numPartitions
>>>>>>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer
>>> level
>>>>>>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API
>>> semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional
>>> in
>>>>>>>>>>>>>>> KStream#through - I
>>>>>>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving
>>> users
>>>>>>>>>>>>>>> possibility to
>>>>>>>>>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :)
>>>>>>>>>>>>> Considering
>>>>>>>>>>>>>>> arguments
>>>>>>>>>>>>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced
>>> class, I see two
>>>>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
>>> auto-generated and
>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions)
>>> - topic
>>>>>>>>>>>>> will
>>>>>>>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions,
>>> final
>>>>>>>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated
>>> with
>>>>>>>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
>>> parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and
>>> introduce
>>>>>>>>>>>>>>> new method
>>>>>>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias
>>> suggested this
>>>>>>>>>>>>>>> in one of
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I
>>> propose the
>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
>>> configuration to
>>>>>>>>>>>>> Grouped
>>>>>>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
>>>>>>>>>>>>>>> KStream#through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
>>> auto-generated and
>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions)
>>> - topic
>>>>>>>>>>>>> will
>>>>>>>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions,
>>> final
>>>>>>>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated
>>> with
>>>>>>>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
>>> parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
>>> configuration to
>>>>>>>>>>>>> Grouped
>>>>>>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator
>>> KStream#repartition for
>>>>>>>>>>>>>>> creating and
>>>>>>>>>>>>>>>>>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was
>>> already
>>>>>>>>>>>>>>> discussed in the
>>>>>>>>>>>>>>>>>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads
>>> that were
>>>>>>>>>>>>>>> about this
>>>>>>>>>>>>>>>>>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani
>>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>>>>>>>>>> levani.codes@gmail.com>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion
>>> around
>>>>>>>>>>>>>>> KIP-221. Going
>>>>>>>>>>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to
>>> agreement
>>>>>>>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far
>>> as I
>>>>>>>>>>>>>>> understood, the
>>>>>>>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the
>>> following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to
>>> KStream#through
>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with
>>> numOfPartitions
>>>>>>>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL
>>> users to
>>>>>>>>>>>>> control
>>>>>>>>>>>>>>>>>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful
>>>>>>>>>>>>> operations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface
>>> changes around
>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> 
>> 
> 


Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by "Matthias J. Sax" <ma...@confluent.io>.
Thanks a lot for the input Sophie.

Your example is quite useful, and I would use it to support my claim
that a "partition hint" for `Grouped` seems "useless" and does not
improve the user experience.

1) You argue that a new user would be worries about repartitions topics
with too many paritions. This would imply that a user is already
advanced enough to understand the implication of repartitioning -- for
this case, I would argue that a user also understand _when_ a
auto-repartitioning would happen and thus the users understands where to
insert a `repartition()` operation.

2) For specifying Serdes: if a `groupByKey()` does not trigger
auto-repartitioning it's not required to specify the serdes and if they
are specified they would be ignored/unused (note, that `groupBy()` would
always trigger a repartitioning). Of course, if the default Serdes from
the config match (eg, all data types are Json anyway), a user does not
need to worry about specifying serdes. -- For new user that play around,
I would assume that they work a lot with primitive types and thus would
need to specify the serdes -- hence, they would learn about
auto-repartitioning the hard way anyhow, because each time a
`groupByKey()` does trigger auto-repartioning, they would need to pass
in the correct Serdes -- this way, they would also be educated where to
insert a `repartition()` operator if needed.

3) If a new user really just "plays around", I don't think they use an
input topic with 100 partitions but most likely have a local single node
broker with most likely single partitions topics.


My main argument for my current proposal is however, that---based on
past experience---it's better to roll out a new feature more carefully
and see how it goes. Last, as John pointed out, we can still extend the
feature in the future. Instead of making a judgment call up-front, being
more conservative and less fancy, and revisit the design based on
actuall user feedback after the first version is rolled out, seems to be
the better option. Undoing a feature is must harder than extending it.


While I advocate strong for a simple first version of this feature, it's
a community decission in the end, and I would not block this KIP if
there is a broad preference to add `Grouped#withNumberOfPartitions()`
either.


-Matthias

On 11/14/19 11:35 PM, Sophie Blee-Goldman wrote:
> It seems like we all agree at this point (please correct me if wrong!) that
> we should NOT change
> the existing repartitioning behavior, ie we should allow Streams to
> continue to determine when and
> where to repartition -- *unless* explicitly informed to by the use of a
> .through or the new .repartition operator.
> 
> Regarding groupBy, the existing behavior we should not disrupt is
> a) repartition *only* when required due to upstream key-changing operation
> (ie don't force repartitioning
> based on the presence of an optional config parameter), and
> b) allow optimization of required repartitions, if any
> 
> Within the constraint of not breaking the existing behavior, this still
> leaves open the question of whether we
> want to improve the user experience by allowing to provide groupBy with a
> *suggestion* for numPartitions (or to
> put it more fairly, whether that *will* improve the experience). I agree
> with many of the arguments outlined above but
> let me just push back on this one issue one final time, and if we can't
> come to a consensus then I am happy to drop
> it for now so that the KIP can proceed.
> 
> Specifically, my proposal would be to simply augment Grouped with an
> optional numPartitions, understood to
> indicate the user's desired number of partitions *if Streams decides to
> repartition due to that groupBy*
> 
>> if a user cares about the number of partition, the user wants to enforce
> a repartitioning
> First, I think we should take a step back and examine this claim. I agree
> 100% that *if this is true,*
> *then we should not give groupBy an optional numPartitions.* As far as I
> see it, there's no argument
> to be had there if we *presuppose that claim.* But I'm not convinced in
> that as an axiom of the user
> experience and think we should be examining that claim itself, not the
> consequences of it.
> 
> To give a simple example, let's say some new user is trying out Streams and
> wants to just play around
> with it to see if it might be worth looking into. They want to just write
> up a simple app and test it out on the
> data in some existing topics they have with a large number of partitions,
> and a lot of data. They're just messing
> around, trying new topologies and don't want to go through each new one
> step by step to determine if (or where)
> a repartition might be required. They also don't want to force a
> repartition if it turns out to not be required, so they'd
> like to avoid the nice new .repartition operator they saw. But given the
> huge number of input partitions, they'd like
> to rest assured that if a repartition does end up being required somewhere
> during dev, it will not be created with
> the same huge number of partitions that their input topic has -- so they
> just pass groupBy a small numPartitions
> suggestion.
> 
> I know that's a bit of a contrived example but I think it does highlight
> how and when this might be a considerable
> quality of life improvement, in particular for new users to Streams and/or
> during the dev cycle. *You don't want to*
> *force a repartition if it wasn't necessary, but you don't want to create a
> topic with a huge partition count either.*
> 
> Also, while the optimization discussion took us down an interesting but
> ultimately more distracting road, it's worth
> pointing out that it is clearly a major win to have as few
> repartition topics/steps as possible. Given that we
> don't want to change existing behavior, the optimization framework can only
> help out when the placement of
> repartition steps is flexible, which means only those from .groupBy (and
> not .repartition). *Users should not*
> *have to choose between allowing Streams to optimize the repartition
> placement, and allowing to specify a *
> *number of partitions.*
> 
> Lastly, I have what may be a stupid question but for my own edification of
> how groupBy works:
> if you do a .groupBy and a repartition is NOT required, does it ever need
> to serialize/deserialize
> any of the data? In other words, if you pass a key/value serde to groupBy
> and it doesn't trigger
> a repartition, is the serde(s) just ignored and thus more like a suggestion
> than a requirement?
> 
> So again, I don't want to hold up this KIP forever but I feel we've spent
> some time getting slightly
> off track (although certainly into very interesting discussions) yet never
> really addressed or questioned
> the basic premise: *could a user want to specify a number of partitions but
> not enforce a repartition (at that*
> *specific point in the topology)?*
> 
> 
> 
> On Fri, Nov 15, 2019 at 12:18 AM Matthias J. Sax <ma...@confluent.io>
> wrote:
> 
>> Side remark:
>>
>> If the user specifies `repartition()` on both side of the join, we can
>> actually throw the execption earlier, ie, when we build the topology.
>>
>> Current, we can do this check only after Kafka Streams was started,
>> within `StreamPartitionAssignor#assign()` -- we still need to keep this
>> check for the case that none or only one side has a user specified
>> number of partitions though.
>>
>>
>> -Matthias
>>
>> On 11/14/19 8:15 AM, John Roesler wrote:
>>> Thanks, all,
>>>
>>> I can get behind just totally leaving out reparation-via-groupBy. If
>>> we only introduce `repartition()` for now, we're making the minimal
>>> change to gain the desired capability.
>>>
>>> Plus, since we agree that `repartition()` should never be optimizable,
>>> it's a future-compatible proposal. I.e., if we were to add a
>>> non-optimizable groupBy(partitions) operation now, and want to make it
>>> optimizable in the future, we have to worry about topology
>>> compatibility. Better to just do non-optimizable `repartition()` now,
>>> and add an optimizable `groupBy(partitions)` in the future (maybe).
>>>
>>> About joins, yes, it's a concern, and IMO we should just do the same
>>> thing we do now... check at runtime that the partition counts on both
>>> sides match and throw an exception otherwise. What this means as a
>>> user is that if you explicitly repartition the left side to 100
>>> partitions, and then join with the right side at 10 partitions, you
>>> get an exception, since this operation is not possible. You'd either
>>> have to "step down" the left side again, back to 10 partitions, or you
>>> could repartition the right side to 100 partitions before the join.
>>> The choice has to be the user's, since it depends on their desired
>>> execution parallelism.
>>>
>>> Thanks,
>>> -John
>>>
>>> On Thu, Nov 14, 2019 at 12:55 AM Matthias J. Sax <ma...@confluent.io>
>> wrote:
>>>>
>>>> Thanks a lot John. I think the way you decompose the operators is super
>>>> helpful for this discussion.
>>>>
>>>> What you suggest with regard to using `Grouped` and enforcing
>>>> repartitioning if the number of partitions is specified is certainly
>>>> possible. However, I am not sure if we _should_ do this. My reasoning is
>>>> that an enforce repartitioning as introduced via `repartition()` is an
>>>> expensive operations, and it seems better to demand an more explicit
>>>> user opt-in to trigger it. Just setting an optional parameter might be
>>>> too subtle to trigger such a heavy "side effect".
>>>>
>>>> While I agree about "usability" in general, I would prefer a more
>>>> conservative appraoch to introduce this feature, see how it goes, and
>>>> maybe make it more advance later on. This also applies to what
>>>> optimzation we may or may not allow (or are able to perform at all).
>>>>
>>>> @Levani: Reflecting about my suggestion about `Repartioned extends
>>>> Grouped`, I agree that it might not be a good idea.
>>>>
>>>> Atm, I see an enforces repartitioning as non-optimizable and as a good
>>>> first step and I would suggest to not intoruce anything else for now.
>>>> Introducing optimizable enforce repartitioning via `groupBy(...,
>>>> Grouped)` is something we could add later.
>>>>
>>>>
>>>> Therefore, I would not change `Grouped` but only introduce
>>>> `repartition()`. Users that use `grouBy()` atm, and want to opt-in to
>>>> set the number of partitions, would need to rewrite their code to
>>>> `selectKey(...).repartition(...).groupByKey()`. It's less convinient but
>>>> also less risky from an API and optimization point of view.
>>>>
>>>>
>>>> @Levani: about joins -> yes, we will need to check the specified number
>>>> of partitions (if any) and if they don't match, throw an exception. We
>>>> can discuss this on the PR -- I am just trying to get the PR for KIP-466
>>>> merged -- your is next on the list :)
>>>>
>>>>
>>>> Thoughts?
>>>>
>>>>
>>>> -Matthias
>>>>
>>>>
>>>> On 11/12/19 4:51 PM, Levani Kokhreidze wrote:
>>>>> Thank you all for an interesting discussion. This is very enlightening.
>>>>>
>>>>> Thank you Matthias for your explanation. Your arguments are very true.
>> It makes sense that if user specifies number of partitions he/she really
>> cares that those specifications are applied to internal topics.
>>>>> Unfortunately, in current implementation this is not true during
>> `join` operation. As I’ve written in the PR comment, currently, when
>> `Stream#join` is used, `CopartitionedTopicsEnforcer` chooses max number of
>> partitions from the two source topics.
>>>>> I’m not really sure what would be the other way around this situation.
>> Maybe fail the stream altogether and inform the user to specify same number
>> of partitions?
>>>>> Or we should treat join operations in a same way as it is right now
>> and basically choose max number of partitions even when `repartition`
>> operation is specified, because Kafka Streams “knows the best” how to
>> handle joins?
>>>>> You can check integration tests how it’s being handled currently. Open
>> to suggestions on that part.
>>>>>
>>>>> As for groupBy, I agree and John raised very interesting points. My
>> arguments for allowing users to specify number of partitions during groupBy
>> operations mainly was coming from the usability perspective.
>>>>> So building on top of what John said, maybe it makes sense to make
>> `groupBy` operations smarter and whenever user specifies
>> `numberOfPartitions` configuration, repartitioning will be enforced, wdyt?
>>>>> I’m not going into optimization part yet :) I think it will be part of
>> separate PR and task, but overall it makes sense to apply optimizations
>> where number of partitions are the same.
>>>>>
>>>>> As for Repartitioned extending Grouped, I kinda feel that it won’t fit
>> nicely in current API design.
>>>>> In addition, in the PR review, John mentioned that there were a lot of
>> troubles in the past trying to use one operation's configuration objects on
>> other operations.
>>>>> Also it makes sense to keep them separate in terms of compatibility.
>>>>> In that case, we don’t have to worry every time Grouped is changed,
>> what would be the implications on `repartition` operations.
>>>>>
>>>>> Kind regards,
>>>>> Levani
>>>>>
>>>>>
>>>>>> On Nov 11, 2019, at 9:13 PM, John Roesler <jo...@confluent.io> wrote:
>>>>>>
>>>>>> Ah, thanks for the clarification. I missed your point.
>>>>>>
>>>>>> I like the framework you've presented. It does seem simpler to assume
>>>>>> that they either care about the partition count and want to
>>>>>> repartition to realize it, or they don't care about the number.
>>>>>> Returning to this discussion, it does seem unlikely that they care
>>>>>> about the number and _don't_ care if it actually gets realized.
>>>>>>
>>>>>> But then, it still seems like we can just keep the option as part of
>>>>>> Grouped. As in:
>>>>>>
>>>>>> // user does not care
>>>>>> stream.groupByKey(Grouped /*not specifying partition count*/)
>>>>>> stream.groupBy(Grouped /*not specifying partition count*/)
>>>>>>
>>>>>> // user does care
>>>>>> stream.repartition(Repartitioned)
>>>>>> stream.groupByKey(Grouped.numberOfPartitions(...))
>>>>>> stream.groupBy(Grouped.numberOfPartitions(...))
>>>>>>
>>>>>> ----
>>>>>>
>>>>>> The above discussion got me thinking about algebra. Matthias is
>>>>>> absolutely right that `groupByKey(numPartitions)` is equivalent to
>>>>>> `repartition(numPartitions).groupByKey()`. I'm just not convinced that
>>>>>> we should force people to apply that expansion themselves vs. having a
>>>>>> more compact way to express it if they don't care where exactly the
>>>>>> repartition occurs. However, thinking about these operators
>>>>>> algebraically can really help *us* narrow down the number of different
>>>>>> expressions we have to consider.
>>>>>>
>>>>>> Let's consider some identities:
>>>>>>
>>>>>> A: groupBy(mapper) + agg = mapKey(mapper) + groupByKey + agg
>>>>>> B: src + ... + groupByKey + agg = src + ... + passthough + agg
>>>>>> C: mapKey(mapper) + ... + groupByKey + agg
>>>>>> = mapKey(mapper) + ... + repartition + groupByKey + agg
>>>>>> D: repartition = sink(managed) + src
>>>>>>
>>>>>> In these identities, I used one special identifier (...), which means
>>>>>> any number (0+) of operations that are not src, mapKey, groupBy[Key],
>>>>>> repartition, or agg.
>>>>>>
>>>>>> For mental clarity, I'm just going to make up a rule that groupBy
>>>>>> operations are not executable. In other words, you have to get to a
>>>>>> point where you can apply B to convert a groupByKey into a passthough
>>>>>> in order to execute the program. This is just a formal way of stating
>>>>>> what already happens in Kafka Streams.
>>>>>>
>>>>>> By applying A, we can just completely leave `groupBy` out of our
>>>>>> analysis. It trivially decomposes into a mapKey followed by a
>>>>>> groupByKey.
>>>>>>
>>>>>> Then, we can eliminate the "repartition required" case of `groupByKey`
>>>>>> by applying C followed by D to get to the "no repartition required"
>>>>>> version of groupByKey, which in turn sets us up to apply B to get an
>>>>>> executable topology.
>>>>>>
>>>>>> Fundamentally, you can think about KIP-221 is as proposing a modified
>>>>>> D identity in which you can specify the partition count of the managed
>>>>>> sink topic:
>>>>>> D': repartition(pc) = sink(managed w/ pc) + src
>>>>>>
>>>>>> Since users _could_ apply the identities above, we don't actually have
>>>>>> to add any partition count to groupBy[Key], but we decided early on in
>>>>>> the KIP discussion that it's more ergonomic to add it. In that case,
>>>>>> we also have to modify A and C:
>>>>>> A': groupBy(mapper, pc) + agg
>>>>>> = mapKey(mapper) + groupByKey(pc) + agg
>>>>>> C': mapKey(mapper) + ... + groupByKey(pc) + agg
>>>>>> = mapKey(mapper) + ... + repartition(pc) + groupByKey + agg
>>>>>>
>>>>>> Which sets us up still to always be able to get back to a plain
>>>>>> `groupByKey` operation (with no `(pc)`) and then apply D' and
>>>>>> ultimately B to get an executable topology.
>>>>>>
>>>>>> What about the optimizer?
>>>>>> The optimizer applies another set of graph-algebraic identities to
>>>>>> minimize the number of repartition topics in a topology.
>>>>>>
>>>>>> (forgive my ascii art)
>>>>>>
>>>>>> E: (merging repartition nodes)
>>>>>> (...) -> repartition -> X
>>>>>>  \-> repartition -> Y
>>>>>> =
>>>>>> (... + repartition) -> X
>>>>>>     \-> Y
>>>>>> F: (reordering around repartition)
>>>>>> Where SVO is any non-key-changing, stateless, operation:
>>>>>> repartition -> SVO = SVO -> repartition
>>>>>>
>>>>>> In terms of these identities, what the optimizer does is apply F
>>>>>> repeatedly in either direction to a topology to factor out common in
>>>>>> branches so that it can apply E to merge repartition nodes. This was
>>>>>> especially necessary before KIP-221 because you couldn't directly
>>>>>> express `repartition` in the DSL, only indirectly via `groupBy[Key]`,
>>>>>> so there was no way to do the factoring manually.
>>>>>>
>>>>>> We can now state very clearly that in KIP-221, explicit
>>>>>> `repartition()` operators should create a "reordering barrier". So, F
>>>>>> cannot be applied to an explicit `repartition()`. Also, I think we
>>>>>> decided earlier that explicit `repartition()` operations would also be
>>>>>> ineligible for merging, so E can't be applied to explicit
>>>>>> `repartition()` operations either. I think we feel we _could_ apply E
>>>>>> without harm, but we want to be conservative for now.
>>>>>>
>>>>>> I think the salient point from the latter discussion has been that
>>>>>> when you use `Grouped.numberOfPartitions`, this does _not_ constitute
>>>>>> an explicit `repartition()` operator, and therefore, the resulting
>>>>>> repartition node remains eligible for optimization.
>>>>>>
>>>>>> To be clear, I agree with Matthias that the provided partition count
>>>>>> _must_ be used in the resulting implicit repartition. This has some
>>>>>> implications for E. Namely, E could only be applied to two repartition
>>>>>> nodes that have the same partition count. This has always been
>>>>>> trivially true before KIP-221 because the partition count has always
>>>>>> been "unspecified", i.e., it would be determined at runtime by the
>>>>>> user-managed-topics' partition counts. Now, it could be specified or
>>>>>> unspecified. We can simply augment E to allow merging only repartition
>>>>>> nodes where the partition count is EITHER "specified and the same on
>>>>>> both sides", OR "unspecified on both sides".
>>>>>>
>>>>>> Sorry for the long email, but I have a hope that it builds a solid
>>>>>> theoretical foundation for our decisions in KIP-221, so we can have
>>>>>> confidence that there are no edge cases for design flaws to hide.
>>>>>>
>>>>>> Thanks,
>>>>>> -John
>>>>>>
>>>>>> On Sat, Nov 9, 2019 at 10:37 PM Matthias J. Sax <
>> matthias@confluent.io> wrote:
>>>>>>>
>>>>>>>> it seems like we do want to allow
>>>>>>>>> people to optionally specify a partition count as part of this
>>>>>>>>> operation, but we don't want that option to _force_ repartitioning
>>>>>>>
>>>>>>> Correct, ie, that is my suggestions.
>>>>>>>
>>>>>>>> "Use P partitions if repartitioning is necessary"
>>>>>>>
>>>>>>> I disagree here, because my reasoning is that:
>>>>>>>
>>>>>>> - if a user cares about the number of partition, the user wants to
>>>>>>> enforce a repartitioning
>>>>>>> - if a user does not case about the number of partitions, we don't
>> need
>>>>>>> to provide them a way to pass in a "hint"
>>>>>>>
>>>>>>> Hence, it should be sufficient to support:
>>>>>>>
>>>>>>> // user does not care
>>>>>>>
>>>>>>>  `stream.groupByKey(Grouped)`
>>>>>>>  `stream.grouBy(..., Grouped)`
>>>>>>>
>>>>>>> // user does care
>>>>>>>
>>>>>>>  `stream.repartition(Repartitioned).groupByKey()`
>>>>>>>  `streams.groupBy(..., Repartitioned)`
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -Matthias
>>>>>>>
>>>>>>>
>>>>>>> On 11/9/19 8:10 PM, John Roesler wrote:
>>>>>>>> Thanks for those thoughts, Matthias,
>>>>>>>>
>>>>>>>> I find your reasoning about the optimization behavior compelling.
>> The
>>>>>>>> `through` operation is very simple and clear to reason about. It
>> just
>>>>>>>> passes the data exactly at the specified point in the topology
>> exactly
>>>>>>>> through the specified topic. Likewise, if a user invokes a
>>>>>>>> `repartition` operator, the simplest behavior is if we just do what
>>>>>>>> they asked for.
>>>>>>>>
>>>>>>>> Stepping back to think about when optimizations are surprising and
>>>>>>>> when they aren't, it occurs to me that we should be free to move
>>>>>>>> around repartitions when users have asked to perform some operation
>>>>>>>> that implies a repartition, like "change keys, then filter, then
>>>>>>>> aggregate". This program requires a repartition, but it could be
>>>>>>>> anywhere between the key change and the aggregation. On the other
>>>>>>>> hand, if they say, "change keys, then filter, then repartition, then
>>>>>>>> aggregate", it seems like they were pretty clear about their desire,
>>>>>>>> and we should just take it at face value.
>>>>>>>>
>>>>>>>> So, I'm sold on just literally doing a repartition every time they
>>>>>>>> invoke the `repartition` operator.
>>>>>>>>
>>>>>>>>
>>>>>>>> The "partition count" modifier for `groupBy`/`groupByKey` is more
>> nuanced.
>>>>>>>>
>>>>>>>> What you said about `groupByKey` makes sense. If they key hasn't
>>>>>>>> actually changed, then we don't need to repartition before
>>>>>>>> aggregating. On the other hand, `groupBy` is specifically changing
>> the
>>>>>>>> key as part of the grouping operation, so (as you said) we
>> definitely
>>>>>>>> have to do a repartition.
>>>>>>>>
>>>>>>>> If I'm reading the discussion right, it seems like we do want to
>> allow
>>>>>>>> people to optionally specify a partition count as part of this
>>>>>>>> operation, but we don't want that option to _force_ repartitioning
>> if
>>>>>>>> it's not needed. That last clause is the key. "Use P partitions if
>>>>>>>> repartitioning is necessary" is a directive that applies cleanly and
>>>>>>>> correctly to both `groupBy` and `groupByKey`. What if we call the
>>>>>>>> option `numberOfPartitionsHint`, which along with the "if necessary"
>>>>>>>> javadoc, should make it clear that the option won't force a
>>>>>>>> repartition, and also gives us enough latitude to still employ the
>>>>>>>> optimizer on those repartition topics?
>>>>>>>>
>>>>>>>> If we like the idea of expressing it as a "hint" for grouping and a
>>>>>>>> "command" for `repartition`, then it seems like it still makes sense
>>>>>>>> to keep Grouped and Repartitioned separate, as they would actually
>>>>>>>> offer different methods with distinct semantics.
>>>>>>>>
>>>>>>>> WDYT?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -John
>>>>>>>>
>>>>>>>> On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <
>> matthias@confluent.io> wrote:
>>>>>>>>>
>>>>>>>>> Sorry for late reply.
>>>>>>>>>
>>>>>>>>> I guess, the question boils down to the intended semantics of
>>>>>>>>> `repartition()`. My understanding is as follows:
>>>>>>>>>
>>>>>>>>> - KS does auto-repartitioning for correctness reasons (using the
>>>>>>>>> upstream topic to determine the number of partitions)
>>>>>>>>> - KS does auto-repartitioning only for downstream DSL operators
>> like
>>>>>>>>> `count()` (eg, a `transform()` does never trigger an
>> auto-repartitioning
>>>>>>>>> even if the stream is marked as `repartitioningRequired`).
>>>>>>>>> - KS offers `through()` to enforce a repartitioning -- however,
>> the user
>>>>>>>>> needs to create the topic manually (with the desired number of
>> partitions).
>>>>>>>>>
>>>>>>>>> I see two main applications for `repartitioning()`:
>>>>>>>>>
>>>>>>>>> 1) repartition data before a `transform()` but user does not want
>> to
>>>>>>>>> manage the topic
>>>>>>>>> 2) scale out a downstream subtopology
>>>>>>>>>
>>>>>>>>> Hence, I see `repartition()` similar to `through()`: if a users
>> calls
>>>>>>>>> it, a repartitining is enforced, with the difference that KS
>> manages the
>>>>>>>>> topic and the user does not need to create it.
>>>>>>>>>
>>>>>>>>> This behavior makes (1) and (2) possible.
>>>>>>>>>
>>>>>>>>>> I think many users would prefer to just say "if there *is* a
>> repartition
>>>>>>>>>> required at this point in the topology, it should
>>>>>>>>>> have N partitions"
>>>>>>>>>
>>>>>>>>> Because of (2), I disagree. Either a user does not care about
>> scaling
>>>>>>>>> out, for which case she would not specify the number of
>> partitions. Or a
>>>>>>>>> user does care, and hence wants to enforce the scale out. I don't
>> think
>>>>>>>>> that any user would say, "maybe scale out".
>>>>>>>>>
>>>>>>>>> Therefore, the optimizer should never ignore the repartition
>> operation.
>>>>>>>>> As a "consequence" (because repartitioning is expensive) a user
>> should
>>>>>>>>> make an explicit call to `repartition()` IMHO -- piggybacking an
>>>>>>>>> enforced repartitioning into `groupByKey()` seems to be "dangerous"
>>>>>>>>> because it might be too subtle and an "optional scaling out" as
>> laid out
>>>>>>>>> above does not make sense IMHO.
>>>>>>>>>
>>>>>>>>> I am also not worried about "over repartitioning" because the
>> result
>>>>>>>>> stream would never trigger auto-repartitioning. Only if multiple
>>>>>>>>> consecutive calls to `repartition()` are made it could be bad --
>> but
>>>>>>>>> that's the same with `through()`. In the end, there is always some
>>>>>>>>> responsibility on the user.
>>>>>>>>>
>>>>>>>>> Btw, for `.groupBy()` we know that repartitioning will be required,
>>>>>>>>> however, for `groupByKey()` it depends if the KStream is marked as
>>>>>>>>> `repartitioningRequired`.
>>>>>>>>>
>>>>>>>>> Hence, for `groupByKey()` it should not be possible for a user to
>> set
>>>>>>>>> number of partitions IMHO. For `groupBy()` it's a different story,
>>>>>>>>> because calling
>>>>>>>>>
>>>>>>>>>   `repartition().groupBy()`
>>>>>>>>>
>>>>>>>>> does not achieve what we want. Hence, allowing users to pass in the
>>>>>>>>> number of users partitions into `groupBy()` does actually makes
>> sense,
>>>>>>>>> because repartitioning will happen anyway and thus we can
>> piggyback a
>>>>>>>>> scaling decision.
>>>>>>>>>
>>>>>>>>> I think that John has a fair concern about the overloads, however,
>> I am
>>>>>>>>> not convinced that using `Grouped` to specify the number of
>> partitions
>>>>>>>>> is intuitive. I double checked `Grouped` and `Repartitioned` and
>> both
>>>>>>>>> allow to specify a `name` and `keySerde/valueSerde`. Thus, I am
>>>>>>>>> wondering if we could bridge the gap between both, if we would make
>>>>>>>>> `Repartitioned extends Grouped`? For this case, we only need
>>>>>>>>> `groupBy(Grouped)` and a user can pass in both types what seems to
>> make
>>>>>>>>> the API quite smooth:
>>>>>>>>>
>>>>>>>>>  `stream.groupBy(..., Grouped...)`
>>>>>>>>>
>>>>>>>>>  `stream.groupBy(..., Repartitioned...)`
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Matthias
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
>>>>>>>>>> Hi Sophie,
>>>>>>>>>>
>>>>>>>>>> Thank you for your reply, very insightful. Looking forward
>> hearing others opinion as well on this.
>>>>>>>>>>
>>>>>>>>>> Kind regards,
>>>>>>>>>> Levani
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <
>> sophie@confluent.io> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
>> other hand
>>>>>>>>>>> Kafka Streams has already
>>>>>>>>>>>> optimizer in place which alters topology independently from user
>>>>>>>>>>>
>>>>>>>>>>> I agree (with you) and think this is a good way to put it -- we
>> currently
>>>>>>>>>>> auto-repartition for the user so
>>>>>>>>>>> that they don't have to walk through their entire topology and
>> reason about
>>>>>>>>>>> when and where to place a
>>>>>>>>>>> `.through` (or the new `.repartition`), so why suddenly force
>> this onto the
>>>>>>>>>>> user? How certain are we that
>>>>>>>>>>> users will always get this right? It's easy to imagine that
>> during
>>>>>>>>>>> development, you write your new app with
>>>>>>>>>>> correctly placed repartitions in order to use this new feature.
>> During the
>>>>>>>>>>> course of development you end up
>>>>>>>>>>> tweaking the topology, but don't remember to review or move the
>>>>>>>>>>> repartitioning since you're used to Streams
>>>>>>>>>>> doing this for you. If you use only single-partition topics for
>> testing,
>>>>>>>>>>> you might not even notice your app is
>>>>>>>>>>> spitting out incorrect results!
>>>>>>>>>>>
>>>>>>>>>>> Anyways, I feel pretty strongly that it would be weird to
>> introduce a new
>>>>>>>>>>> feature and say that to use it, you can't take
>>>>>>>>>>> advantage of this other feature anymore. Also, is it possible our
>>>>>>>>>>> optimization framework could ever include an
>>>>>>>>>>> optimized repartitioning strategy that is better than what a
>> user could
>>>>>>>>>>> achieve by manually inserting repartitions?
>>>>>>>>>>> Do we expect users to have a deep understanding of the best way
>> to
>>>>>>>>>>> repartition their particular topology, or is it
>>>>>>>>>>> likely they will end up over-repartitioning either due to missed
>>>>>>>>>>> optimizations or unnecessary extra repartitions?
>>>>>>>>>>> I think many users would prefer to just say "if there *is* a
>> repartition
>>>>>>>>>>> required at this point in the topology, it should
>>>>>>>>>>> have N partitions"
>>>>>>>>>>>
>>>>>>>>>>> As to the idea of adding `numberOfPartitions` to Grouped rather
>> than
>>>>>>>>>>> adding a new parameter to groupBy, that does seem more in line
>> with the
>>>>>>>>>>> current syntax so +1 from me
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <
>> levani.codes@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>
>>>>>>>>>>>> While https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> is under review and
>> it’s
>>>>>>>>>>>> almost done, I want to resurrect discussion about this KIP to
>> address
>>>>>>>>>>>> couple of concerns raised by Matthias and John.
>>>>>>>>>>>>
>>>>>>>>>>>> As a reminder, idea of the KIP-221 was to allow DSL users
>> control over
>>>>>>>>>>>> repartitioning and parallelism of sub-topologies by:
>>>>>>>>>>>> 1) Introducing new KStream#repartition operation which is done
>> in
>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>
>>>>>>>>>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which is
>> planned to
>>>>>>>>>>>> be separate PR.
>>>>>>>>>>>>
>>>>>>>>>>>> While all agree about general implementation and idea behind
>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> PR, introducing new
>>>>>>>>>>>> KStream#groupBy(Repartitioned) method overload raised some
>> questions during
>>>>>>>>>>>> the review.
>>>>>>>>>>>> Matthias raised concern that there can be cases when user uses
>>>>>>>>>>>> `KStream#groupBy(Repartitioned)` operation, but actual
>> repartitioning may
>>>>>>>>>>>> not required, thus configuration passed via `Repartitioned`
>> would never be
>>>>>>>>>>>> applied (Matthias, please correct me if I misinterpreted your
>> comment).
>>>>>>>>>>>> So instead, if user wants to control parallelism of
>> sub-topologies, he or
>>>>>>>>>>>> she should always use `KStream#repartition` operation before
>> groupBy. Full
>>>>>>>>>>>> comment can be seen here:
>>>>>>>>>>>>
>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
>>>>>>>>>>>>
>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
>>>>>>>>>>>>
>>>>>>>>>>>> On the same topic, John pointed out that, from API design
>> perspective, we
>>>>>>>>>>>> shouldn’t intertwine configuration classes of different
>> operators between
>>>>>>>>>>>> one another. So instead of introducing new
>> `KStream#groupBy(Repartitioned)`
>>>>>>>>>>>> for specifying number of partitions for internal topic, we
>> should update
>>>>>>>>>>>> existing `Grouped` class with `numberOfPartitions` field.
>>>>>>>>>>>>
>>>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
>> other hand
>>>>>>>>>>>> Kafka Streams has already optimizer in place which alters
>> topology
>>>>>>>>>>>> independently from user. So maybe it makes sense if Kafka
>> Streams,
>>>>>>>>>>>> internally would optimize topology in the best way possible,
>> even if in
>>>>>>>>>>>> some cases this means ignoring some operator configurations
>> passed by the
>>>>>>>>>>>> user. Also, I agree with John about API design semantics. If we
>> go through
>>>>>>>>>>>> with the changes for `KStream#groupBy` operation, it makes more
>> sense to
>>>>>>>>>>>> add `numberOfPartitions` field to `Grouped` class instead of
>> introducing
>>>>>>>>>>>> new `KStream#groupBy(Repartitioned)` method overload.
>>>>>>>>>>>>
>>>>>>>>>>>> I would really appreciate communities feedback on this.
>>>>>>>>>>>>
>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>> Levani
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <
>> sophie@confluent.io>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think people are busy with the upcoming 2.4 release, and
>> don't have
>>>>>>>>>>>> much
>>>>>>>>>>>>> spare time at the
>>>>>>>>>>>>> moment. It's kind of a difficult time to get attention on
>> things, but
>>>>>>>>>>>> feel
>>>>>>>>>>>>> free to pick up something else
>>>>>>>>>>>>> to work on in the meantime until things have calmed down a bit!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sorry for bringing this thread again, but I would like to get
>> some
>>>>>>>>>>>>>> attention on this PR:
>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170> <
>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170>>
>>>>>>>>>>>>>> It's been a while now and I would love to move on to other
>> KIPs as well.
>>>>>>>>>>>>>> Please let me know if you have any concerns.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here’s voting thread for this KIP:
>>>>>>>>>>>>>>
>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>
>> <
>>>>>>>>>>>>>>
>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html
>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>>
>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion on
>> that one.
>>>>>>>>>>>>>>>> Agree that avoiding unnecessary method overloads is a good
>> idea.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Updated KIP
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <
>> matthias@confluent.io
>>>>>>>>>>>> <ma...@confluent.io>
>>>>>>>>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>>>
>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> One question:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Why do we add
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Repartitioned#with(final String name, final int
>> numberOfPartitions)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It seems that `#with(String name)`,
>> `#numberOfPartitions(int)` in
>>>>>>>>>>>>>>>>> combination with `withName()` and
>> `withNumberOfPartitions()` should
>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> sufficient. Users can chain the method calls.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (I think it's valuable to keep the number of overload
>> small if
>>>>>>>>>>>>>> possible.)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Otherwise LGTM.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks all for your feedback.
>>>>>>>>>>>>>>>>>> I started voting procedure for this KIP. If there’re any
>> other
>>>>>>>>>>>>>> concerns about this KIP, please let me know.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>> <mailto:
>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, makes sense.
>>>>>>>>>>>>>>>>>>> I’ve updated KIP (
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>> <
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>
>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>> <
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>> <
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>
>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>> <
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>
>>>>>>>>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>>
>> <mailto:
>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>>>>
>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for driving the KIP.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I agree that users need to be able to specify a
>> partitioning
>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Sophie raises a fair point about topic configs and
>> producer
>>>>>>>>>>>>>> configs. My
>>>>>>>>>>>>>>>>>>>> take is, that consider `Repartitioned` as an
>> "extension" to
>>>>>>>>>>>>>> `Produced`,
>>>>>>>>>>>>>>>>>>>> that adds topic configuration, is a good way to think
>> about it and
>>>>>>>>>>>>>> helps
>>>>>>>>>>>>>>>>>>>> to keep the API "clean".
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> With regard to method names. I would prefer to avoid
>>>>>>>>>>>> abbreviations.
>>>>>>>>>>>>>> Can
>>>>>>>>>>>>>>>>>>>> we rename:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Furthermore, it might be good to add some more `static`
>> methods:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>>>>>>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>>>>>>>>>>>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>> Totally agree. I think in KStream interface it makes
>> sense to
>>>>>>>>>>>> have
>>>>>>>>>>>>>> some duplicate configurations between operators in order to
>> keep API
>>>>>>>>>>>> simple
>>>>>>>>>>>>>> and usable.
>>>>>>>>>>>>>>>>>>>>> Also, as more surface API has, harder it is to have
>> proper
>>>>>>>>>>>>>> backward compatibility.
>>>>>>>>>>>>>>>>>>>>> While initial idea of keeping topic level configs
>> separate was
>>>>>>>>>>>>>> exciting, having Repartitioned class encapsulate some
>> producer level
>>>>>>>>>>>>>> configs makes API more readable.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I think that is a good point about trying to keep
>> producer level
>>>>>>>>>>>>>>>>>>>>>> configurations and (repartition) topic level
>> considerations
>>>>>>>>>>>>>> separate.
>>>>>>>>>>>>>>>>>>>>>> Number of partitions is definitely purely a topic
>> level
>>>>>>>>>>>>>> configuration. But
>>>>>>>>>>>>>>>>>>>>>> on some level, serdes and partitioners are just as
>> much a topic
>>>>>>>>>>>>>>>>>>>>>> configuration as a producer one. You could have two
>> producers
>>>>>>>>>>>>>> configured
>>>>>>>>>>>>>>>>>>>>>> with different serdes and/or partitioners, but if
>> they are
>>>>>>>>>>>>>> writing to the
>>>>>>>>>>>>>>>>>>>>>> same topic the result would be very difficult to
>> part. So in a
>>>>>>>>>>>>>> sense, these
>>>>>>>>>>>>>>>>>>>>>> are configurations of topics in Streams, not just
>> producers.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Another way to think of it: while the Streams API is
>> not always
>>>>>>>>>>>>>> true to
>>>>>>>>>>>>>>>>>>>>>> this, ideally all the relevant configs for an
>> operator are
>>>>>>>>>>>>>> wrapped into a
>>>>>>>>>>>>>>>>>>>>>> single object (in this case, Repartitioned). We could
>> instead
>>>>>>>>>>>>>> split out the
>>>>>>>>>>>>>>>>>>>>>> fields in common with Produced into a separate
>> parameter to keep
>>>>>>>>>>>>>> topic and
>>>>>>>>>>>>>>>>>>>>>> producer level configurations separate, but this
>> increases the
>>>>>>>>>>>>>> API surface
>>>>>>>>>>>>>>>>>>>>>> area by a lot. It's much more straightforward to just
>> say "this
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> everything that this particular operator needs"
>> without worrying
>>>>>>>>>>>>>> about what
>>>>>>>>>>>>>>>>>>>>>> exactly you're specifying.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I suppose you could alternatively make Produced a
>> field of
>>>>>>>>>>>>>> Repartitioned,
>>>>>>>>>>>>>>>>>>>>>> but I don't think we do this kind of composition
>> elsewhere in
>>>>>>>>>>>>>> Streams at
>>>>>>>>>>>>>>>>>>>>>> the moment
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>> <mailto:
>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>> <mailto:
>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>> <mailto:
>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi Bill,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
>>>>>>>>>>>>>> `Repartitioned#partitioner`
>>>>>>>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class for
>> topic level
>>>>>>>>>>>>>>>>>>>>>>> configuration and keep topic level and producer level
>>>>>>>>>>>>>> configurations (such
>>>>>>>>>>>>>>>>>>>>>>> as Produced) separately (see my second email in this
>> thread).
>>>>>>>>>>>>>>>>>>>>>>> But while looking at the semantics of KStream
>> interface, I
>>>>>>>>>>>>>> couldn’t really
>>>>>>>>>>>>>>>>>>>>>>> figure out good operation name for Topic level
>> configuration
>>>>>>>>>>>>>> class and just
>>>>>>>>>>>>>>>>>>>>>>> introducing `Topic` config class was kinda breaking
>> the
>>>>>>>>>>>>>> semantics.
>>>>>>>>>>>>>>>>>>>>>>> So I think having Repartitioned class which
>> encapsulates topic
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> producer level configurations for internal topics is
>> viable
>>>>>>>>>>>>>> thing to do.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <
>> bbejeck@gmail.com
>>>>>>>>>>>> <ma...@gmail.com>
>>>>>>>>>>>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>>
>> <mailto:
>>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi Lavani,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.  In
>> addition to
>>>>>>>>>>>>>> the reason
>>>>>>>>>>>>>>>>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 1. Users may want to use something other than
>> hash-based
>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>>>>>>>>>>> 2. Users may wish to partition on something
>> different than the
>>>>>>>>>>>>>> key
>>>>>>>>>>>>>>>>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>>>>>>>>>>>>>>>> 1. A combination of fields in the value in
>> conjunction with
>>>>>>>>>>>>>> the key
>>>>>>>>>>>>>>>>>>>>>>>> 2. Something other than the key
>>>>>>>>>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on
>> Produced hence
>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes sense
>> for API
>>>>>>>>>>>>>> consistency.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Just my  2 cents.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>> levani.codes@gmail.com>
>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>> <mailto:
>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>> <mailto:
>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>>>>>>>>>>>>>>>>> If we add partitioner configuration to
>> Repartitioned class,
>>>>>>>>>>>>>> with the
>>>>>>>>>>>>>>>>>>>>>>>>> combination of specifying number of partitions for
>> internal
>>>>>>>>>>>>>> topics, user
>>>>>>>>>>>>>>>>>>>>>>>>> will have opportunity to ensure co-partitioning
>> before join
>>>>>>>>>>>>>> operation.
>>>>>>>>>>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>>>>>>>>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
>>>>>>>>>>>> john@confluent.io <ma...@confluent.io>
>>>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>
>> <mailto:
>>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again,
>> not totally
>>>>>>>>>>>>>> sure it
>>>>>>>>>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar is
>> the
>>>>>>>>>>>> rationale
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
>>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
>> levani.codes@gmail.com>
>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>
>> <mailto:
>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey John,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your example
>> I would
>>>>>>>>>>>>>> first issue
>>>>>>>>>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner
>> that
>>>>>>>>>>>>>> essentially
>>>>>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>>>>> be the same as the topic I want to join with and
>> then do the
>>>>>>>>>>>>>>>>>>>>>>> KStream#join
>>>>>>>>>>>>>>>>>>>>>>>>> with DSL?
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
>>>>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
>> john@confluent.io
>>>>>>>>>>>> <ma...@confluent.io>> <mailto:john@confluent.io <mailto:
>>>>>>>>>>>> john@confluent.io>
>>>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an option to
>> specify
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is that
>> some data may
>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input
>> topic. If the
>>>>>>>>>>>>>> right-side
>>>>>>>>>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning
>> strategy, then the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be
>> partitioned with the
>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss
>> something
>>>>>>>>>>>>>> important?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani
>> Kokhreidze
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
>> levani.codes@gmail.com>
>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>
>> <mailto:
>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be
>> honest I’m
>>>>>>>>>>>> not
>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really
>> think I would
>>>>>>>>>>>>>> need control
>>>>>>>>>>>>>>>>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams
>> knows best
>>>>>>>>>>>>>> how to
>>>>>>>>>>>>>>>>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be
>> used only for
>>>>>>>>>>>>>> topics
>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have
>> possibility to
>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> partitioner.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I
>> guess
>>>>>>>>>>>>>> specifying the
>>>>>>>>>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s
>> agreement on
>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie
>> Blee-Goldman <
>>>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
>> <mailto:
>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that
>> Repartitioned
>>>>>>>>>>>>>> would be a
>>>>>>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need
>> to have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar
>> to
>>>>>>>>>>>>>> Produced? I'm not
>>>>>>>>>>>>>>>>>>>>>>>>> sure how
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but seems
>> it should
>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani
>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>> levani.codes@gmail.com>
>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>
>> <mailto:
>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka
>> Streams.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more
>> control over
>>>>>>>>>>>>>> the topic
>>>>>>>>>>>>>>>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is
>> something that
>>>>>>>>>>>> is
>>>>>>>>>>>>>> missing
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over
>> parallelism by
>>>>>>>>>>>>>> configuring
>>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie
>> Blee-Goldman <
>>>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
>> <mailto:
>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one
>> thing for me
>>>>>>>>>>>> --
>>>>>>>>>>>>>> for the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a
>> Repartitioned,
>>>>>>>>>>>>>> will the
>>>>>>>>>>>>>>>>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be
>> the case
>>>>>>>>>>>> for
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> signature
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to
>> be
>>>>>>>>>>>>>> pre-created? The
>>>>>>>>>>>>>>>>>>>>>>>>> wording
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of
>> the method
>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani
>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>> levani.codes@gmail.com>
>>>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>
>> <mailto:
>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>> <
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>
>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>> <
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>> <
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>
>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>> <
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>> <
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>
>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities
>> opinions/concerns
>>>>>>>>>>>> about
>>>>>>>>>>>>>> this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani
>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani
>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve
>> updated
>>>>>>>>>>>>>> confluence
>>>>>>>>>>>>>>>>>>>>>>> page
>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives”
>> section.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani
>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing
>> dedicated `Topic`
>>>>>>>>>>>>>> class for
>>>>>>>>>>>>>>>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators like
>>>>>>>>>>>> `groupedBy`.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion
>> about this
>>>>>>>>>>>> as
>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias
>> J. Sax <
>>>>>>>>>>>>>>>>>>>>>>>>> matthias@confluent.io
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And
>> thanks for
>>>>>>>>>>>>>> summarizing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been
>> discussed
>>>>>>>>>>>>>> already (can't
>>>>>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good
>> summary to
>>>>>>>>>>>>>> refresh the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense.
>> With regard
>>>>>>>>>>>> to
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics
>> and
>>>>>>>>>>>> operators
>>>>>>>>>>>>>> that use
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it
>> might
>>>>>>>>>>>> indicate
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that
>> uses
>>>>>>>>>>>>>> use-defined
>>>>>>>>>>>>>>>>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()`
>> operator (an
>>>>>>>>>>>>>> operator that
>>>>>>>>>>>>>>>>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good.
>> Otherwise, there is
>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages
>> topics but
>>>>>>>>>>>>>> sometimes
>>>>>>>>>>>>>>>>>>>>>>> not; a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the
>> distinction
>>>>>>>>>>>>>> clearer.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to
>> `Grouped`. I am
>>>>>>>>>>>>>> wondering
>>>>>>>>>>>>>>>>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply
>> and adding
>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get
>> opinions of
>>>>>>>>>>>>>> others on
>>>>>>>>>>>>>>>>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer
>> atm.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects
>> that allow
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> configure
>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer,
>> store. If we
>>>>>>>>>>>>>> assume that
>>>>>>>>>>>>>>>>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we
>> should have a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class
>> and method
>>>>>>>>>>>>>> instead of
>>>>>>>>>>>>>>>>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add
>> other
>>>>>>>>>>>> configs,
>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily,
>> without the
>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure
>> if I like
>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>>>>>>>> myself.
>>>>>>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze
>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though -
>> maybe
>>>>>>>>>>>> enhancing
>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the
>> best approach.
>>>>>>>>>>>>>> Let me
>>>>>>>>>>>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with
>> this
>>>>>>>>>>>>>> configuration,
>>>>>>>>>>>>>>>>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since
>> KStream#to is
>>>>>>>>>>>>>> the final
>>>>>>>>>>>>>>>>>>>>>>>>> sink of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable
>>>>>>>>>>>> assumption
>>>>>>>>>>>>>> that user
>>>>>>>>>>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And
>> in that
>>>>>>>>>>>>>> case, having
>>>>>>>>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much
>> sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based
>> on API
>>>>>>>>>>>>>> contract, seems
>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that
>> is
>>>>>>>>>>>>>> explicitly for
>>>>>>>>>>>>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner
>> those all
>>>>>>>>>>>>>> are producer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is
>> topic level
>>>>>>>>>>>>>>>>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
>>>>>>>>>>>>>> configurations
>>>>>>>>>>>>>>>>>>>>>>>>> together in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface,
>> seems like
>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal
>> (e.g
>>>>>>>>>>>> topics
>>>>>>>>>>>>>> created
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I
>> think we
>>>>>>>>>>>>>> should leave
>>>>>>>>>>>>>>>>>>>>>>>>> it as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account,
>> I think we
>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka
>> Streams should
>>>>>>>>>>>>>> create and
>>>>>>>>>>>>>>>>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs
>> topics that
>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>>>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding
>> numPartitions
>>>>>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer
>> level
>>>>>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API
>> semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional
>> in
>>>>>>>>>>>>>> KStream#through - I
>>>>>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving
>> users
>>>>>>>>>>>>>> possibility to
>>>>>>>>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :)
>>>>>>>>>>>> Considering
>>>>>>>>>>>>>> arguments
>>>>>>>>>>>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced
>> class, I see two
>>>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
>> auto-generated and
>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions)
>> - topic
>>>>>>>>>>>> will
>>>>>>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions,
>> final
>>>>>>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated
>> with
>>>>>>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
>> parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and
>> introduce
>>>>>>>>>>>>>> new method
>>>>>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias
>> suggested this
>>>>>>>>>>>>>> in one of
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I
>> propose the
>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
>> configuration to
>>>>>>>>>>>> Grouped
>>>>>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
>>>>>>>>>>>>>> KStream#through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
>> auto-generated and
>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions)
>> - topic
>>>>>>>>>>>> will
>>>>>>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions,
>> final
>>>>>>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated
>> with
>>>>>>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
>> parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
>> configuration to
>>>>>>>>>>>> Grouped
>>>>>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator
>> KStream#repartition for
>>>>>>>>>>>>>> creating and
>>>>>>>>>>>>>>>>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was
>> already
>>>>>>>>>>>>>> discussed in the
>>>>>>>>>>>>>>>>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads
>> that were
>>>>>>>>>>>>>> about this
>>>>>>>>>>>>>>>>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani
>> Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>>>>>>>>> levani.codes@gmail.com>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion
>> around
>>>>>>>>>>>>>> KIP-221. Going
>>>>>>>>>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to
>> agreement
>>>>>>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far
>> as I
>>>>>>>>>>>>>> understood, the
>>>>>>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the
>> following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to
>> KStream#through
>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with
>> numOfPartitions
>>>>>>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL
>> users to
>>>>>>>>>>>> control
>>>>>>>>>>>>>>>>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful
>>>>>>>>>>>> operations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface
>> changes around
>>>>>>>>>>>>>>>>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>
>>
>>
> 


Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by Sophie Blee-Goldman <so...@confluent.io>.
It seems like we all agree at this point (please correct me if wrong!) that
we should NOT change
the existing repartitioning behavior, ie we should allow Streams to
continue to determine when and
where to repartition -- *unless* explicitly informed to by the use of a
.through or the new .repartition operator.

Regarding groupBy, the existing behavior we should not disrupt is
a) repartition *only* when required due to upstream key-changing operation
(ie don't force repartitioning
based on the presence of an optional config parameter), and
b) allow optimization of required repartitions, if any

Within the constraint of not breaking the existing behavior, this still
leaves open the question of whether we
want to improve the user experience by allowing to provide groupBy with a
*suggestion* for numPartitions (or to
put it more fairly, whether that *will* improve the experience). I agree
with many of the arguments outlined above but
let me just push back on this one issue one final time, and if we can't
come to a consensus then I am happy to drop
it for now so that the KIP can proceed.

Specifically, my proposal would be to simply augment Grouped with an
optional numPartitions, understood to
indicate the user's desired number of partitions *if Streams decides to
repartition due to that groupBy*

> if a user cares about the number of partition, the user wants to enforce
a repartitioning
First, I think we should take a step back and examine this claim. I agree
100% that *if this is true,*
*then we should not give groupBy an optional numPartitions.* As far as I
see it, there's no argument
to be had there if we *presuppose that claim.* But I'm not convinced in
that as an axiom of the user
experience and think we should be examining that claim itself, not the
consequences of it.

To give a simple example, let's say some new user is trying out Streams and
wants to just play around
with it to see if it might be worth looking into. They want to just write
up a simple app and test it out on the
data in some existing topics they have with a large number of partitions,
and a lot of data. They're just messing
around, trying new topologies and don't want to go through each new one
step by step to determine if (or where)
a repartition might be required. They also don't want to force a
repartition if it turns out to not be required, so they'd
like to avoid the nice new .repartition operator they saw. But given the
huge number of input partitions, they'd like
to rest assured that if a repartition does end up being required somewhere
during dev, it will not be created with
the same huge number of partitions that their input topic has -- so they
just pass groupBy a small numPartitions
suggestion.

I know that's a bit of a contrived example but I think it does highlight
how and when this might be a considerable
quality of life improvement, in particular for new users to Streams and/or
during the dev cycle. *You don't want to*
*force a repartition if it wasn't necessary, but you don't want to create a
topic with a huge partition count either.*

Also, while the optimization discussion took us down an interesting but
ultimately more distracting road, it's worth
pointing out that it is clearly a major win to have as few
repartition topics/steps as possible. Given that we
don't want to change existing behavior, the optimization framework can only
help out when the placement of
repartition steps is flexible, which means only those from .groupBy (and
not .repartition). *Users should not*
*have to choose between allowing Streams to optimize the repartition
placement, and allowing to specify a *
*number of partitions.*

Lastly, I have what may be a stupid question but for my own edification of
how groupBy works:
if you do a .groupBy and a repartition is NOT required, does it ever need
to serialize/deserialize
any of the data? In other words, if you pass a key/value serde to groupBy
and it doesn't trigger
a repartition, is the serde(s) just ignored and thus more like a suggestion
than a requirement?

So again, I don't want to hold up this KIP forever but I feel we've spent
some time getting slightly
off track (although certainly into very interesting discussions) yet never
really addressed or questioned
the basic premise: *could a user want to specify a number of partitions but
not enforce a repartition (at that*
*specific point in the topology)?*



On Fri, Nov 15, 2019 at 12:18 AM Matthias J. Sax <ma...@confluent.io>
wrote:

> Side remark:
>
> If the user specifies `repartition()` on both side of the join, we can
> actually throw the execption earlier, ie, when we build the topology.
>
> Current, we can do this check only after Kafka Streams was started,
> within `StreamPartitionAssignor#assign()` -- we still need to keep this
> check for the case that none or only one side has a user specified
> number of partitions though.
>
>
> -Matthias
>
> On 11/14/19 8:15 AM, John Roesler wrote:
> > Thanks, all,
> >
> > I can get behind just totally leaving out reparation-via-groupBy. If
> > we only introduce `repartition()` for now, we're making the minimal
> > change to gain the desired capability.
> >
> > Plus, since we agree that `repartition()` should never be optimizable,
> > it's a future-compatible proposal. I.e., if we were to add a
> > non-optimizable groupBy(partitions) operation now, and want to make it
> > optimizable in the future, we have to worry about topology
> > compatibility. Better to just do non-optimizable `repartition()` now,
> > and add an optimizable `groupBy(partitions)` in the future (maybe).
> >
> > About joins, yes, it's a concern, and IMO we should just do the same
> > thing we do now... check at runtime that the partition counts on both
> > sides match and throw an exception otherwise. What this means as a
> > user is that if you explicitly repartition the left side to 100
> > partitions, and then join with the right side at 10 partitions, you
> > get an exception, since this operation is not possible. You'd either
> > have to "step down" the left side again, back to 10 partitions, or you
> > could repartition the right side to 100 partitions before the join.
> > The choice has to be the user's, since it depends on their desired
> > execution parallelism.
> >
> > Thanks,
> > -John
> >
> > On Thu, Nov 14, 2019 at 12:55 AM Matthias J. Sax <ma...@confluent.io>
> wrote:
> >>
> >> Thanks a lot John. I think the way you decompose the operators is super
> >> helpful for this discussion.
> >>
> >> What you suggest with regard to using `Grouped` and enforcing
> >> repartitioning if the number of partitions is specified is certainly
> >> possible. However, I am not sure if we _should_ do this. My reasoning is
> >> that an enforce repartitioning as introduced via `repartition()` is an
> >> expensive operations, and it seems better to demand an more explicit
> >> user opt-in to trigger it. Just setting an optional parameter might be
> >> too subtle to trigger such a heavy "side effect".
> >>
> >> While I agree about "usability" in general, I would prefer a more
> >> conservative appraoch to introduce this feature, see how it goes, and
> >> maybe make it more advance later on. This also applies to what
> >> optimzation we may or may not allow (or are able to perform at all).
> >>
> >> @Levani: Reflecting about my suggestion about `Repartioned extends
> >> Grouped`, I agree that it might not be a good idea.
> >>
> >> Atm, I see an enforces repartitioning as non-optimizable and as a good
> >> first step and I would suggest to not intoruce anything else for now.
> >> Introducing optimizable enforce repartitioning via `groupBy(...,
> >> Grouped)` is something we could add later.
> >>
> >>
> >> Therefore, I would not change `Grouped` but only introduce
> >> `repartition()`. Users that use `grouBy()` atm, and want to opt-in to
> >> set the number of partitions, would need to rewrite their code to
> >> `selectKey(...).repartition(...).groupByKey()`. It's less convinient but
> >> also less risky from an API and optimization point of view.
> >>
> >>
> >> @Levani: about joins -> yes, we will need to check the specified number
> >> of partitions (if any) and if they don't match, throw an exception. We
> >> can discuss this on the PR -- I am just trying to get the PR for KIP-466
> >> merged -- your is next on the list :)
> >>
> >>
> >> Thoughts?
> >>
> >>
> >> -Matthias
> >>
> >>
> >> On 11/12/19 4:51 PM, Levani Kokhreidze wrote:
> >>> Thank you all for an interesting discussion. This is very enlightening.
> >>>
> >>> Thank you Matthias for your explanation. Your arguments are very true.
> It makes sense that if user specifies number of partitions he/she really
> cares that those specifications are applied to internal topics.
> >>> Unfortunately, in current implementation this is not true during
> `join` operation. As I’ve written in the PR comment, currently, when
> `Stream#join` is used, `CopartitionedTopicsEnforcer` chooses max number of
> partitions from the two source topics.
> >>> I’m not really sure what would be the other way around this situation.
> Maybe fail the stream altogether and inform the user to specify same number
> of partitions?
> >>> Or we should treat join operations in a same way as it is right now
> and basically choose max number of partitions even when `repartition`
> operation is specified, because Kafka Streams “knows the best” how to
> handle joins?
> >>> You can check integration tests how it’s being handled currently. Open
> to suggestions on that part.
> >>>
> >>> As for groupBy, I agree and John raised very interesting points. My
> arguments for allowing users to specify number of partitions during groupBy
> operations mainly was coming from the usability perspective.
> >>> So building on top of what John said, maybe it makes sense to make
> `groupBy` operations smarter and whenever user specifies
> `numberOfPartitions` configuration, repartitioning will be enforced, wdyt?
> >>> I’m not going into optimization part yet :) I think it will be part of
> separate PR and task, but overall it makes sense to apply optimizations
> where number of partitions are the same.
> >>>
> >>> As for Repartitioned extending Grouped, I kinda feel that it won’t fit
> nicely in current API design.
> >>> In addition, in the PR review, John mentioned that there were a lot of
> troubles in the past trying to use one operation's configuration objects on
> other operations.
> >>> Also it makes sense to keep them separate in terms of compatibility.
> >>> In that case, we don’t have to worry every time Grouped is changed,
> what would be the implications on `repartition` operations.
> >>>
> >>> Kind regards,
> >>> Levani
> >>>
> >>>
> >>>> On Nov 11, 2019, at 9:13 PM, John Roesler <jo...@confluent.io> wrote:
> >>>>
> >>>> Ah, thanks for the clarification. I missed your point.
> >>>>
> >>>> I like the framework you've presented. It does seem simpler to assume
> >>>> that they either care about the partition count and want to
> >>>> repartition to realize it, or they don't care about the number.
> >>>> Returning to this discussion, it does seem unlikely that they care
> >>>> about the number and _don't_ care if it actually gets realized.
> >>>>
> >>>> But then, it still seems like we can just keep the option as part of
> >>>> Grouped. As in:
> >>>>
> >>>> // user does not care
> >>>> stream.groupByKey(Grouped /*not specifying partition count*/)
> >>>> stream.groupBy(Grouped /*not specifying partition count*/)
> >>>>
> >>>> // user does care
> >>>> stream.repartition(Repartitioned)
> >>>> stream.groupByKey(Grouped.numberOfPartitions(...))
> >>>> stream.groupBy(Grouped.numberOfPartitions(...))
> >>>>
> >>>> ----
> >>>>
> >>>> The above discussion got me thinking about algebra. Matthias is
> >>>> absolutely right that `groupByKey(numPartitions)` is equivalent to
> >>>> `repartition(numPartitions).groupByKey()`. I'm just not convinced that
> >>>> we should force people to apply that expansion themselves vs. having a
> >>>> more compact way to express it if they don't care where exactly the
> >>>> repartition occurs. However, thinking about these operators
> >>>> algebraically can really help *us* narrow down the number of different
> >>>> expressions we have to consider.
> >>>>
> >>>> Let's consider some identities:
> >>>>
> >>>> A: groupBy(mapper) + agg = mapKey(mapper) + groupByKey + agg
> >>>> B: src + ... + groupByKey + agg = src + ... + passthough + agg
> >>>> C: mapKey(mapper) + ... + groupByKey + agg
> >>>> = mapKey(mapper) + ... + repartition + groupByKey + agg
> >>>> D: repartition = sink(managed) + src
> >>>>
> >>>> In these identities, I used one special identifier (...), which means
> >>>> any number (0+) of operations that are not src, mapKey, groupBy[Key],
> >>>> repartition, or agg.
> >>>>
> >>>> For mental clarity, I'm just going to make up a rule that groupBy
> >>>> operations are not executable. In other words, you have to get to a
> >>>> point where you can apply B to convert a groupByKey into a passthough
> >>>> in order to execute the program. This is just a formal way of stating
> >>>> what already happens in Kafka Streams.
> >>>>
> >>>> By applying A, we can just completely leave `groupBy` out of our
> >>>> analysis. It trivially decomposes into a mapKey followed by a
> >>>> groupByKey.
> >>>>
> >>>> Then, we can eliminate the "repartition required" case of `groupByKey`
> >>>> by applying C followed by D to get to the "no repartition required"
> >>>> version of groupByKey, which in turn sets us up to apply B to get an
> >>>> executable topology.
> >>>>
> >>>> Fundamentally, you can think about KIP-221 is as proposing a modified
> >>>> D identity in which you can specify the partition count of the managed
> >>>> sink topic:
> >>>> D': repartition(pc) = sink(managed w/ pc) + src
> >>>>
> >>>> Since users _could_ apply the identities above, we don't actually have
> >>>> to add any partition count to groupBy[Key], but we decided early on in
> >>>> the KIP discussion that it's more ergonomic to add it. In that case,
> >>>> we also have to modify A and C:
> >>>> A': groupBy(mapper, pc) + agg
> >>>> = mapKey(mapper) + groupByKey(pc) + agg
> >>>> C': mapKey(mapper) + ... + groupByKey(pc) + agg
> >>>> = mapKey(mapper) + ... + repartition(pc) + groupByKey + agg
> >>>>
> >>>> Which sets us up still to always be able to get back to a plain
> >>>> `groupByKey` operation (with no `(pc)`) and then apply D' and
> >>>> ultimately B to get an executable topology.
> >>>>
> >>>> What about the optimizer?
> >>>> The optimizer applies another set of graph-algebraic identities to
> >>>> minimize the number of repartition topics in a topology.
> >>>>
> >>>> (forgive my ascii art)
> >>>>
> >>>> E: (merging repartition nodes)
> >>>> (...) -> repartition -> X
> >>>>  \-> repartition -> Y
> >>>> =
> >>>> (... + repartition) -> X
> >>>>     \-> Y
> >>>> F: (reordering around repartition)
> >>>> Where SVO is any non-key-changing, stateless, operation:
> >>>> repartition -> SVO = SVO -> repartition
> >>>>
> >>>> In terms of these identities, what the optimizer does is apply F
> >>>> repeatedly in either direction to a topology to factor out common in
> >>>> branches so that it can apply E to merge repartition nodes. This was
> >>>> especially necessary before KIP-221 because you couldn't directly
> >>>> express `repartition` in the DSL, only indirectly via `groupBy[Key]`,
> >>>> so there was no way to do the factoring manually.
> >>>>
> >>>> We can now state very clearly that in KIP-221, explicit
> >>>> `repartition()` operators should create a "reordering barrier". So, F
> >>>> cannot be applied to an explicit `repartition()`. Also, I think we
> >>>> decided earlier that explicit `repartition()` operations would also be
> >>>> ineligible for merging, so E can't be applied to explicit
> >>>> `repartition()` operations either. I think we feel we _could_ apply E
> >>>> without harm, but we want to be conservative for now.
> >>>>
> >>>> I think the salient point from the latter discussion has been that
> >>>> when you use `Grouped.numberOfPartitions`, this does _not_ constitute
> >>>> an explicit `repartition()` operator, and therefore, the resulting
> >>>> repartition node remains eligible for optimization.
> >>>>
> >>>> To be clear, I agree with Matthias that the provided partition count
> >>>> _must_ be used in the resulting implicit repartition. This has some
> >>>> implications for E. Namely, E could only be applied to two repartition
> >>>> nodes that have the same partition count. This has always been
> >>>> trivially true before KIP-221 because the partition count has always
> >>>> been "unspecified", i.e., it would be determined at runtime by the
> >>>> user-managed-topics' partition counts. Now, it could be specified or
> >>>> unspecified. We can simply augment E to allow merging only repartition
> >>>> nodes where the partition count is EITHER "specified and the same on
> >>>> both sides", OR "unspecified on both sides".
> >>>>
> >>>> Sorry for the long email, but I have a hope that it builds a solid
> >>>> theoretical foundation for our decisions in KIP-221, so we can have
> >>>> confidence that there are no edge cases for design flaws to hide.
> >>>>
> >>>> Thanks,
> >>>> -John
> >>>>
> >>>> On Sat, Nov 9, 2019 at 10:37 PM Matthias J. Sax <
> matthias@confluent.io> wrote:
> >>>>>
> >>>>>> it seems like we do want to allow
> >>>>>>> people to optionally specify a partition count as part of this
> >>>>>>> operation, but we don't want that option to _force_ repartitioning
> >>>>>
> >>>>> Correct, ie, that is my suggestions.
> >>>>>
> >>>>>> "Use P partitions if repartitioning is necessary"
> >>>>>
> >>>>> I disagree here, because my reasoning is that:
> >>>>>
> >>>>> - if a user cares about the number of partition, the user wants to
> >>>>> enforce a repartitioning
> >>>>> - if a user does not case about the number of partitions, we don't
> need
> >>>>> to provide them a way to pass in a "hint"
> >>>>>
> >>>>> Hence, it should be sufficient to support:
> >>>>>
> >>>>> // user does not care
> >>>>>
> >>>>>  `stream.groupByKey(Grouped)`
> >>>>>  `stream.grouBy(..., Grouped)`
> >>>>>
> >>>>> // user does care
> >>>>>
> >>>>>  `stream.repartition(Repartitioned).groupByKey()`
> >>>>>  `streams.groupBy(..., Repartitioned)`
> >>>>>
> >>>>>
> >>>>>
> >>>>> -Matthias
> >>>>>
> >>>>>
> >>>>> On 11/9/19 8:10 PM, John Roesler wrote:
> >>>>>> Thanks for those thoughts, Matthias,
> >>>>>>
> >>>>>> I find your reasoning about the optimization behavior compelling.
> The
> >>>>>> `through` operation is very simple and clear to reason about. It
> just
> >>>>>> passes the data exactly at the specified point in the topology
> exactly
> >>>>>> through the specified topic. Likewise, if a user invokes a
> >>>>>> `repartition` operator, the simplest behavior is if we just do what
> >>>>>> they asked for.
> >>>>>>
> >>>>>> Stepping back to think about when optimizations are surprising and
> >>>>>> when they aren't, it occurs to me that we should be free to move
> >>>>>> around repartitions when users have asked to perform some operation
> >>>>>> that implies a repartition, like "change keys, then filter, then
> >>>>>> aggregate". This program requires a repartition, but it could be
> >>>>>> anywhere between the key change and the aggregation. On the other
> >>>>>> hand, if they say, "change keys, then filter, then repartition, then
> >>>>>> aggregate", it seems like they were pretty clear about their desire,
> >>>>>> and we should just take it at face value.
> >>>>>>
> >>>>>> So, I'm sold on just literally doing a repartition every time they
> >>>>>> invoke the `repartition` operator.
> >>>>>>
> >>>>>>
> >>>>>> The "partition count" modifier for `groupBy`/`groupByKey` is more
> nuanced.
> >>>>>>
> >>>>>> What you said about `groupByKey` makes sense. If they key hasn't
> >>>>>> actually changed, then we don't need to repartition before
> >>>>>> aggregating. On the other hand, `groupBy` is specifically changing
> the
> >>>>>> key as part of the grouping operation, so (as you said) we
> definitely
> >>>>>> have to do a repartition.
> >>>>>>
> >>>>>> If I'm reading the discussion right, it seems like we do want to
> allow
> >>>>>> people to optionally specify a partition count as part of this
> >>>>>> operation, but we don't want that option to _force_ repartitioning
> if
> >>>>>> it's not needed. That last clause is the key. "Use P partitions if
> >>>>>> repartitioning is necessary" is a directive that applies cleanly and
> >>>>>> correctly to both `groupBy` and `groupByKey`. What if we call the
> >>>>>> option `numberOfPartitionsHint`, which along with the "if necessary"
> >>>>>> javadoc, should make it clear that the option won't force a
> >>>>>> repartition, and also gives us enough latitude to still employ the
> >>>>>> optimizer on those repartition topics?
> >>>>>>
> >>>>>> If we like the idea of expressing it as a "hint" for grouping and a
> >>>>>> "command" for `repartition`, then it seems like it still makes sense
> >>>>>> to keep Grouped and Repartitioned separate, as they would actually
> >>>>>> offer different methods with distinct semantics.
> >>>>>>
> >>>>>> WDYT?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> -John
> >>>>>>
> >>>>>> On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <
> matthias@confluent.io> wrote:
> >>>>>>>
> >>>>>>> Sorry for late reply.
> >>>>>>>
> >>>>>>> I guess, the question boils down to the intended semantics of
> >>>>>>> `repartition()`. My understanding is as follows:
> >>>>>>>
> >>>>>>> - KS does auto-repartitioning for correctness reasons (using the
> >>>>>>> upstream topic to determine the number of partitions)
> >>>>>>> - KS does auto-repartitioning only for downstream DSL operators
> like
> >>>>>>> `count()` (eg, a `transform()` does never trigger an
> auto-repartitioning
> >>>>>>> even if the stream is marked as `repartitioningRequired`).
> >>>>>>> - KS offers `through()` to enforce a repartitioning -- however,
> the user
> >>>>>>> needs to create the topic manually (with the desired number of
> partitions).
> >>>>>>>
> >>>>>>> I see two main applications for `repartitioning()`:
> >>>>>>>
> >>>>>>> 1) repartition data before a `transform()` but user does not want
> to
> >>>>>>> manage the topic
> >>>>>>> 2) scale out a downstream subtopology
> >>>>>>>
> >>>>>>> Hence, I see `repartition()` similar to `through()`: if a users
> calls
> >>>>>>> it, a repartitining is enforced, with the difference that KS
> manages the
> >>>>>>> topic and the user does not need to create it.
> >>>>>>>
> >>>>>>> This behavior makes (1) and (2) possible.
> >>>>>>>
> >>>>>>>> I think many users would prefer to just say "if there *is* a
> repartition
> >>>>>>>> required at this point in the topology, it should
> >>>>>>>> have N partitions"
> >>>>>>>
> >>>>>>> Because of (2), I disagree. Either a user does not care about
> scaling
> >>>>>>> out, for which case she would not specify the number of
> partitions. Or a
> >>>>>>> user does care, and hence wants to enforce the scale out. I don't
> think
> >>>>>>> that any user would say, "maybe scale out".
> >>>>>>>
> >>>>>>> Therefore, the optimizer should never ignore the repartition
> operation.
> >>>>>>> As a "consequence" (because repartitioning is expensive) a user
> should
> >>>>>>> make an explicit call to `repartition()` IMHO -- piggybacking an
> >>>>>>> enforced repartitioning into `groupByKey()` seems to be "dangerous"
> >>>>>>> because it might be too subtle and an "optional scaling out" as
> laid out
> >>>>>>> above does not make sense IMHO.
> >>>>>>>
> >>>>>>> I am also not worried about "over repartitioning" because the
> result
> >>>>>>> stream would never trigger auto-repartitioning. Only if multiple
> >>>>>>> consecutive calls to `repartition()` are made it could be bad --
> but
> >>>>>>> that's the same with `through()`. In the end, there is always some
> >>>>>>> responsibility on the user.
> >>>>>>>
> >>>>>>> Btw, for `.groupBy()` we know that repartitioning will be required,
> >>>>>>> however, for `groupByKey()` it depends if the KStream is marked as
> >>>>>>> `repartitioningRequired`.
> >>>>>>>
> >>>>>>> Hence, for `groupByKey()` it should not be possible for a user to
> set
> >>>>>>> number of partitions IMHO. For `groupBy()` it's a different story,
> >>>>>>> because calling
> >>>>>>>
> >>>>>>>   `repartition().groupBy()`
> >>>>>>>
> >>>>>>> does not achieve what we want. Hence, allowing users to pass in the
> >>>>>>> number of users partitions into `groupBy()` does actually makes
> sense,
> >>>>>>> because repartitioning will happen anyway and thus we can
> piggyback a
> >>>>>>> scaling decision.
> >>>>>>>
> >>>>>>> I think that John has a fair concern about the overloads, however,
> I am
> >>>>>>> not convinced that using `Grouped` to specify the number of
> partitions
> >>>>>>> is intuitive. I double checked `Grouped` and `Repartitioned` and
> both
> >>>>>>> allow to specify a `name` and `keySerde/valueSerde`. Thus, I am
> >>>>>>> wondering if we could bridge the gap between both, if we would make
> >>>>>>> `Repartitioned extends Grouped`? For this case, we only need
> >>>>>>> `groupBy(Grouped)` and a user can pass in both types what seems to
> make
> >>>>>>> the API quite smooth:
> >>>>>>>
> >>>>>>>  `stream.groupBy(..., Grouped...)`
> >>>>>>>
> >>>>>>>  `stream.groupBy(..., Repartitioned...)`
> >>>>>>>
> >>>>>>>
> >>>>>>> Thoughts?
> >>>>>>>
> >>>>>>>
> >>>>>>> -Matthias
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
> >>>>>>>> Hi Sophie,
> >>>>>>>>
> >>>>>>>> Thank you for your reply, very insightful. Looking forward
> hearing others opinion as well on this.
> >>>>>>>>
> >>>>>>>> Kind regards,
> >>>>>>>> Levani
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <
> sophie@confluent.io> wrote:
> >>>>>>>>>
> >>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
> other hand
> >>>>>>>>> Kafka Streams has already
> >>>>>>>>>> optimizer in place which alters topology independently from user
> >>>>>>>>>
> >>>>>>>>> I agree (with you) and think this is a good way to put it -- we
> currently
> >>>>>>>>> auto-repartition for the user so
> >>>>>>>>> that they don't have to walk through their entire topology and
> reason about
> >>>>>>>>> when and where to place a
> >>>>>>>>> `.through` (or the new `.repartition`), so why suddenly force
> this onto the
> >>>>>>>>> user? How certain are we that
> >>>>>>>>> users will always get this right? It's easy to imagine that
> during
> >>>>>>>>> development, you write your new app with
> >>>>>>>>> correctly placed repartitions in order to use this new feature.
> During the
> >>>>>>>>> course of development you end up
> >>>>>>>>> tweaking the topology, but don't remember to review or move the
> >>>>>>>>> repartitioning since you're used to Streams
> >>>>>>>>> doing this for you. If you use only single-partition topics for
> testing,
> >>>>>>>>> you might not even notice your app is
> >>>>>>>>> spitting out incorrect results!
> >>>>>>>>>
> >>>>>>>>> Anyways, I feel pretty strongly that it would be weird to
> introduce a new
> >>>>>>>>> feature and say that to use it, you can't take
> >>>>>>>>> advantage of this other feature anymore. Also, is it possible our
> >>>>>>>>> optimization framework could ever include an
> >>>>>>>>> optimized repartitioning strategy that is better than what a
> user could
> >>>>>>>>> achieve by manually inserting repartitions?
> >>>>>>>>> Do we expect users to have a deep understanding of the best way
> to
> >>>>>>>>> repartition their particular topology, or is it
> >>>>>>>>> likely they will end up over-repartitioning either due to missed
> >>>>>>>>> optimizations or unnecessary extra repartitions?
> >>>>>>>>> I think many users would prefer to just say "if there *is* a
> repartition
> >>>>>>>>> required at this point in the topology, it should
> >>>>>>>>> have N partitions"
> >>>>>>>>>
> >>>>>>>>> As to the idea of adding `numberOfPartitions` to Grouped rather
> than
> >>>>>>>>> adding a new parameter to groupBy, that does seem more in line
> with the
> >>>>>>>>> current syntax so +1 from me
> >>>>>>>>>
> >>>>>>>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <
> levani.codes@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hello all,
> >>>>>>>>>>
> >>>>>>>>>> While https://github.com/apache/kafka/pull/7170 <
> >>>>>>>>>> https://github.com/apache/kafka/pull/7170> is under review and
> it’s
> >>>>>>>>>> almost done, I want to resurrect discussion about this KIP to
> address
> >>>>>>>>>> couple of concerns raised by Matthias and John.
> >>>>>>>>>>
> >>>>>>>>>> As a reminder, idea of the KIP-221 was to allow DSL users
> control over
> >>>>>>>>>> repartitioning and parallelism of sub-topologies by:
> >>>>>>>>>> 1) Introducing new KStream#repartition operation which is done
> in
> >>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
> >>>>>>>>>> https://github.com/apache/kafka/pull/7170>
> >>>>>>>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which is
> planned to
> >>>>>>>>>> be separate PR.
> >>>>>>>>>>
> >>>>>>>>>> While all agree about general implementation and idea behind
> >>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
> >>>>>>>>>> https://github.com/apache/kafka/pull/7170> PR, introducing new
> >>>>>>>>>> KStream#groupBy(Repartitioned) method overload raised some
> questions during
> >>>>>>>>>> the review.
> >>>>>>>>>> Matthias raised concern that there can be cases when user uses
> >>>>>>>>>> `KStream#groupBy(Repartitioned)` operation, but actual
> repartitioning may
> >>>>>>>>>> not required, thus configuration passed via `Repartitioned`
> would never be
> >>>>>>>>>> applied (Matthias, please correct me if I misinterpreted your
> comment).
> >>>>>>>>>> So instead, if user wants to control parallelism of
> sub-topologies, he or
> >>>>>>>>>> she should always use `KStream#repartition` operation before
> groupBy. Full
> >>>>>>>>>> comment can be seen here:
> >>>>>>>>>>
> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
> >>>>>>>>>>
> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
> >>>>>>>>>>
> >>>>>>>>>> On the same topic, John pointed out that, from API design
> perspective, we
> >>>>>>>>>> shouldn’t intertwine configuration classes of different
> operators between
> >>>>>>>>>> one another. So instead of introducing new
> `KStream#groupBy(Repartitioned)`
> >>>>>>>>>> for specifying number of partitions for internal topic, we
> should update
> >>>>>>>>>> existing `Grouped` class with `numberOfPartitions` field.
> >>>>>>>>>>
> >>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the
> other hand
> >>>>>>>>>> Kafka Streams has already optimizer in place which alters
> topology
> >>>>>>>>>> independently from user. So maybe it makes sense if Kafka
> Streams,
> >>>>>>>>>> internally would optimize topology in the best way possible,
> even if in
> >>>>>>>>>> some cases this means ignoring some operator configurations
> passed by the
> >>>>>>>>>> user. Also, I agree with John about API design semantics. If we
> go through
> >>>>>>>>>> with the changes for `KStream#groupBy` operation, it makes more
> sense to
> >>>>>>>>>> add `numberOfPartitions` field to `Grouped` class instead of
> introducing
> >>>>>>>>>> new `KStream#groupBy(Repartitioned)` method overload.
> >>>>>>>>>>
> >>>>>>>>>> I would really appreciate communities feedback on this.
> >>>>>>>>>>
> >>>>>>>>>> Kind regards,
> >>>>>>>>>> Levani
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <
> sophie@confluent.io>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hey Levani,
> >>>>>>>>>>>
> >>>>>>>>>>> I think people are busy with the upcoming 2.4 release, and
> don't have
> >>>>>>>>>> much
> >>>>>>>>>>> spare time at the
> >>>>>>>>>>> moment. It's kind of a difficult time to get attention on
> things, but
> >>>>>>>>>> feel
> >>>>>>>>>>> free to pick up something else
> >>>>>>>>>>> to work on in the meantime until things have calmed down a bit!
> >>>>>>>>>>>
> >>>>>>>>>>> Cheers,
> >>>>>>>>>>> Sophie
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
> >>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hello all,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Sorry for bringing this thread again, but I would like to get
> some
> >>>>>>>>>>>> attention on this PR:
> https://github.com/apache/kafka/pull/7170 <
> >>>>>>>>>> https://github.com/apache/kafka/pull/7170> <
> >>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
> >>>>>>>>>> https://github.com/apache/kafka/pull/7170>>
> >>>>>>>>>>>> It's been a while now and I would love to move on to other
> KIPs as well.
> >>>>>>>>>>>> Please let me know if you have any concerns.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regards,
> >>>>>>>>>>>> Levani
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
> >>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Here’s voting thread for this KIP:
> >>>>>>>>>>>>
> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
> >>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>
> <
> >>>>>>>>>>>>
> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
> >>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html
> >>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
> >>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>>
> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Matthias,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion on
> that one.
> >>>>>>>>>>>>>> Agree that avoiding unnecessary method overloads is a good
> idea.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Updated KIP
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <
> matthias@confluent.io
> >>>>>>>>>> <ma...@confluent.io>
> >>>>>>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>>>
> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> One question:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Why do we add
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Repartitioned#with(final String name, final int
> numberOfPartitions)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It seems that `#with(String name)`,
> `#numberOfPartitions(int)` in
> >>>>>>>>>>>>>>> combination with `withName()` and
> `withNumberOfPartitions()` should
> >>>>>>>>>> be
> >>>>>>>>>>>>>>> sufficient. Users can chain the method calls.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> (I think it's valuable to keep the number of overload
> small if
> >>>>>>>>>>>> possible.)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Otherwise LGTM.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
> >>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks all for your feedback.
> >>>>>>>>>>>>>>>> I started voting procedure for this KIP. If there’re any
> other
> >>>>>>>>>>>> concerns about this KIP, please let me know.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
> >>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> <mailto:
> >>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Matthias,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks for the suggestion, makes sense.
> >>>>>>>>>>>>>>>>> I’ve updated KIP (
> >>>>>>>>>>>>
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>> <
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>
> >>>>>>>>>>>> <
> >>>>>>>>>>>>
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>> <
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>
> >>>>>>>>>>>> <
> >>>>>>>>>>>>
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>> <
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>
> >>>>>>>>>>>> <
> >>>>>>>>>>>>
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>> <
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>
> >>>>>>>>>>>>>> ).
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
> >>>>>>>>>> matthias@confluent.io <ma...@confluent.io>
> >>>>>>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>>
> <mailto:
> >>>>>>>>>> matthias@confluent.io <ma...@confluent.io> <mailto:
> >>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>>>>
> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks for driving the KIP.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I agree that users need to be able to specify a
> partitioning
> >>>>>>>>>>>> strategy.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Sophie raises a fair point about topic configs and
> producer
> >>>>>>>>>>>> configs. My
> >>>>>>>>>>>>>>>>>> take is, that consider `Repartitioned` as an
> "extension" to
> >>>>>>>>>>>> `Produced`,
> >>>>>>>>>>>>>>>>>> that adds topic configuration, is a good way to think
> about it and
> >>>>>>>>>>>> helps
> >>>>>>>>>>>>>>>>>> to keep the API "clean".
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> With regard to method names. I would prefer to avoid
> >>>>>>>>>> abbreviations.
> >>>>>>>>>>>> Can
> >>>>>>>>>>>>>>>>>> we rename:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Furthermore, it might be good to add some more `static`
> methods:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
> >>>>>>>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
> >>>>>>>>>>>>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
> >>>>>>>>>>>>>>>>>>> Totally agree. I think in KStream interface it makes
> sense to
> >>>>>>>>>> have
> >>>>>>>>>>>> some duplicate configurations between operators in order to
> keep API
> >>>>>>>>>> simple
> >>>>>>>>>>>> and usable.
> >>>>>>>>>>>>>>>>>>> Also, as more surface API has, harder it is to have
> proper
> >>>>>>>>>>>> backward compatibility.
> >>>>>>>>>>>>>>>>>>> While initial idea of keeping topic level configs
> separate was
> >>>>>>>>>>>> exciting, having Repartitioned class encapsulate some
> producer level
> >>>>>>>>>>>> configs makes API more readable.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
> >>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> >>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> >>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I think that is a good point about trying to keep
> producer level
> >>>>>>>>>>>>>>>>>>>> configurations and (repartition) topic level
> considerations
> >>>>>>>>>>>> separate.
> >>>>>>>>>>>>>>>>>>>> Number of partitions is definitely purely a topic
> level
> >>>>>>>>>>>> configuration. But
> >>>>>>>>>>>>>>>>>>>> on some level, serdes and partitioners are just as
> much a topic
> >>>>>>>>>>>>>>>>>>>> configuration as a producer one. You could have two
> producers
> >>>>>>>>>>>> configured
> >>>>>>>>>>>>>>>>>>>> with different serdes and/or partitioners, but if
> they are
> >>>>>>>>>>>> writing to the
> >>>>>>>>>>>>>>>>>>>> same topic the result would be very difficult to
> part. So in a
> >>>>>>>>>>>> sense, these
> >>>>>>>>>>>>>>>>>>>> are configurations of topics in Streams, not just
> producers.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Another way to think of it: while the Streams API is
> not always
> >>>>>>>>>>>> true to
> >>>>>>>>>>>>>>>>>>>> this, ideally all the relevant configs for an
> operator are
> >>>>>>>>>>>> wrapped into a
> >>>>>>>>>>>>>>>>>>>> single object (in this case, Repartitioned). We could
> instead
> >>>>>>>>>>>> split out the
> >>>>>>>>>>>>>>>>>>>> fields in common with Produced into a separate
> parameter to keep
> >>>>>>>>>>>> topic and
> >>>>>>>>>>>>>>>>>>>> producer level configurations separate, but this
> increases the
> >>>>>>>>>>>> API surface
> >>>>>>>>>>>>>>>>>>>> area by a lot. It's much more straightforward to just
> say "this
> >>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>> everything that this particular operator needs"
> without worrying
> >>>>>>>>>>>> about what
> >>>>>>>>>>>>>>>>>>>> exactly you're specifying.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I suppose you could alternatively make Produced a
> field of
> >>>>>>>>>>>> Repartitioned,
> >>>>>>>>>>>>>>>>>>>> but I don't think we do this kind of composition
> elsewhere in
> >>>>>>>>>>>> Streams at
> >>>>>>>>>>>>>>>>>>>> the moment
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
> >>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> <mailto:
> >>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
> <mailto:
> >>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> <mailto:
> >>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi Bill,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks a lot for the feedback.
> >>>>>>>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
> >>>>>>>>>>>> `Repartitioned#partitioner`
> >>>>>>>>>>>>>>>>>>>>> configuration.
> >>>>>>>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class for
> topic level
> >>>>>>>>>>>>>>>>>>>>> configuration and keep topic level and producer level
> >>>>>>>>>>>> configurations (such
> >>>>>>>>>>>>>>>>>>>>> as Produced) separately (see my second email in this
> thread).
> >>>>>>>>>>>>>>>>>>>>> But while looking at the semantics of KStream
> interface, I
> >>>>>>>>>>>> couldn’t really
> >>>>>>>>>>>>>>>>>>>>> figure out good operation name for Topic level
> configuration
> >>>>>>>>>>>> class and just
> >>>>>>>>>>>>>>>>>>>>> introducing `Topic` config class was kinda breaking
> the
> >>>>>>>>>>>> semantics.
> >>>>>>>>>>>>>>>>>>>>> So I think having Repartitioned class which
> encapsulates topic
> >>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>> producer level configurations for internal topics is
> viable
> >>>>>>>>>>>> thing to do.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <
> bbejeck@gmail.com
> >>>>>>>>>> <ma...@gmail.com>
> >>>>>>>>>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>>
> <mailto:
> >>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
> >>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hi Lavani,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.  In
> addition to
> >>>>>>>>>>>> the reason
> >>>>>>>>>>>>>>>>>>>>>> provided by John, my reasoning is:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> 1. Users may want to use something other than
> hash-based
> >>>>>>>>>>>> partitioning
> >>>>>>>>>>>>>>>>>>>>>> 2. Users may wish to partition on something
> different than the
> >>>>>>>>>>>> key
> >>>>>>>>>>>>>>>>>>>>>> without having to change the key.  For example:
> >>>>>>>>>>>>>>>>>>>>>> 1. A combination of fields in the value in
> conjunction with
> >>>>>>>>>>>> the key
> >>>>>>>>>>>>>>>>>>>>>> 2. Something other than the key
> >>>>>>>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on
> Produced hence
> >>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes sense
> for API
> >>>>>>>>>>>> consistency.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Just my  2 cents.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>> Bill
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> levani.codes@gmail.com>
> >>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
> <mailto:
> >>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> <mailto:
> >>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hi John,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> In my mind it makes sense.
> >>>>>>>>>>>>>>>>>>>>>>> If we add partitioner configuration to
> Repartitioned class,
> >>>>>>>>>>>> with the
> >>>>>>>>>>>>>>>>>>>>>>> combination of specifying number of partitions for
> internal
> >>>>>>>>>>>> topics, user
> >>>>>>>>>>>>>>>>>>>>>>> will have opportunity to ensure co-partitioning
> before join
> >>>>>>>>>>>> operation.
> >>>>>>>>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
> >>>>>>>>>>>>>>>>>>>>>>> Wondering what others think about this?
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
> >>>>>>>>>> john@confluent.io <ma...@confluent.io>
> >>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>
> <mailto:
> >>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
> >>>>>>>>>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again,
> not totally
> >>>>>>>>>>>> sure it
> >>>>>>>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar is
> the
> >>>>>>>>>> rationale
> >>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
> >>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
> levani.codes@gmail.com>
> >>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
> >>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>
> <mailto:
> >>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Hey John,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
> >>>>>>>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your example
> I would
> >>>>>>>>>>>> first issue
> >>>>>>>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner
> that
> >>>>>>>>>>>> essentially
> >>>>>>>>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>>>>>>> be the same as the topic I want to join with and
> then do the
> >>>>>>>>>>>>>>>>>>>>> KStream#join
> >>>>>>>>>>>>>>>>>>>>>>> with DSL?
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
> >>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
> john@confluent.io
> >>>>>>>>>> <ma...@confluent.io>> <mailto:john@confluent.io <mailto:
> >>>>>>>>>> john@confluent.io>
> >>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an option to
> specify
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is that
> some data may
> >>>>>>>>>>>> get
> >>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input
> topic. If the
> >>>>>>>>>>>> right-side
> >>>>>>>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning
> strategy, then the
> >>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be
> partitioned with the
> >>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>>> strategy.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss
> something
> >>>>>>>>>>>> important?
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani
> Kokhreidze
> >>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <mailto:
> levani.codes@gmail.com>
> >>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
> >>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>
> <mailto:
> >>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be
> honest I’m
> >>>>>>>>>> not
> >>>>>>>>>>>> sure
> >>>>>>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>>>>> it yet.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really
> think I would
> >>>>>>>>>>>> need control
> >>>>>>>>>>>>>>>>>>>>>>> over partitioner for internal topics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams
> knows best
> >>>>>>>>>>>> how to
> >>>>>>>>>>>>>>>>>>>>>>> partition data for internal topics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be
> used only for
> >>>>>>>>>>>> topics
> >>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>> are created by user In advance.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have
> possibility to
> >>>>>>>>>>>> specify
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> partitioner.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I
> guess
> >>>>>>>>>>>> specifying the
> >>>>>>>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s
> agreement on
> >>>>>>>>>> this.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie
> Blee-Goldman <
> >>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
> <mailto:
> >>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that
> Repartitioned
> >>>>>>>>>>>> would be a
> >>>>>>>>>>>>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need
> to have
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar
> to
> >>>>>>>>>>>> Produced? I'm not
> >>>>>>>>>>>>>>>>>>>>>>> sure how
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but seems
> it should
> >>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>> available
> >>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> repartition topics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani
> Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> levani.codes@gmail.com>
> >>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>
> <mailto:
> >>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
> >>>>>>>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka
> Streams.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more
> control over
> >>>>>>>>>>>> the topic
> >>>>>>>>>>>>>>>>>>>>>>> such as
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> num of partitions.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is
> something that
> >>>>>>>>>> is
> >>>>>>>>>>>> missing
> >>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> current DSL design.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over
> parallelism by
> >>>>>>>>>>>> configuring
> >>>>>>>>>>>>>>>>>>>>> num
> >>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie
> Blee-Goldman <
> >>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>
> <mailto:
> >>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one
> thing for me
> >>>>>>>>>> --
> >>>>>>>>>>>> for the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a
> Repartitioned,
> >>>>>>>>>>>> will the
> >>>>>>>>>>>>>>>>>>>>>>> topic be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be
> the case
> >>>>>>>>>> for
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> signature
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to
> be
> >>>>>>>>>>>> pre-created? The
> >>>>>>>>>>>>>>>>>>>>>>> wording
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of
> the method
> >>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>>> auto-create
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sophie
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani
> Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> levani.codes@gmail.com>
> >>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>
> <mailto:
> >>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>> <
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>
> >>>>>>>>>>>> <
> >>>>>>>>>>>>
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>> <
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>
> >>>>>>>>>>>> <
> >>>>>>>>>>>>
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>> <
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>
> >>>>>>>>>>>> <
> >>>>>>>>>>>>
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>> <
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>> <
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>
> >>>>>>>>>>>> <
> >>>>>>>>>>>>
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> )
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities
> opinions/concerns
> >>>>>>>>>> about
> >>>>>>>>>>>> this KIP.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani
> Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani
> Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve
> updated
> >>>>>>>>>>>> confluence
> >>>>>>>>>>>>>>>>>>>>> page
> >>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives”
> section.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani
> Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing
> dedicated `Topic`
> >>>>>>>>>>>> class for
> >>>>>>>>>>>>>>>>>>>>>>> topic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators like
> >>>>>>>>>> `groupedBy`.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion
> about this
> >>>>>>>>>> as
> >>>>>>>>>>>> well.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias
> J. Sax <
> >>>>>>>>>>>>>>>>>>>>>>> matthias@confluent.io
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And
> thanks for
> >>>>>>>>>>>> summarizing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been
> discussed
> >>>>>>>>>>>> already (can't
> >>>>>>>>>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good
> summary to
> >>>>>>>>>>>> refresh the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense.
> With regard
> >>>>>>>>>> to
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> distinction
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics
> and
> >>>>>>>>>> operators
> >>>>>>>>>>>> that use
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user-created
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it
> might
> >>>>>>>>>> indicate
> >>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>> leaving
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that
> uses
> >>>>>>>>>>>> use-defined
> >>>>>>>>>>>>>>>>>>>>>>> topics) and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()`
> operator (an
> >>>>>>>>>>>> operator that
> >>>>>>>>>>>>>>>>>>>>>>> manages
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good.
> Otherwise, there is
> >>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages
> topics but
> >>>>>>>>>>>> sometimes
> >>>>>>>>>>>>>>>>>>>>> not; a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the
> distinction
> >>>>>>>>>>>> clearer.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to
> `Grouped`. I am
> >>>>>>>>>>>> wondering
> >>>>>>>>>>>>>>>>>>>>>>> if the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply
> and adding
> >>>>>>>>>>>> it is
> >>>>>>>>>>>>>>>>>>>>>>> semantically
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get
> opinions of
> >>>>>>>>>>>> others on
> >>>>>>>>>>>>>>>>>>>>>>> this, too.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> am
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer
> atm.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects
> that allow
> >>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> configure
> >>>>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer,
> store. If we
> >>>>>>>>>>>> assume that
> >>>>>>>>>>>>>>>>>>>>>>> a topic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we
> should have a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class
> and method
> >>>>>>>>>>>> instead of
> >>>>>>>>>>>>>>>>>>>>>>> a plain
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add
> other
> >>>>>>>>>> configs,
> >>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>> replication
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily,
> without the
> >>>>>>>>>>>> need to
> >>>>>>>>>>>>>>>>>>>>>>> change the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "main
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure
> if I like
> >>>>>>>>>>>> them
> >>>>>>>>>>>>>>>>>>>>> myself.
> >>>>>>>>>>>>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze
> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though -
> maybe
> >>>>>>>>>> enhancing
> >>>>>>>>>>>> Produced
> >>>>>>>>>>>>>>>>>>>>>>> with num
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the
> best approach.
> >>>>>>>>>>>> Let me
> >>>>>>>>>>>>>>>>>>>>>>> explain
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> why:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with
> this
> >>>>>>>>>>>> configuration,
> >>>>>>>>>>>>>>>>>>>>>>> this will
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since
> KStream#to is
> >>>>>>>>>>>> the final
> >>>>>>>>>>>>>>>>>>>>>>> sink of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable
> >>>>>>>>>> assumption
> >>>>>>>>>>>> that user
> >>>>>>>>>>>>>>>>>>>>>>> needs
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And
> in that
> >>>>>>>>>>>> case, having
> >>>>>>>>>>>>>>>>>>>>>>> num of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much
> sense.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based
> on API
> >>>>>>>>>>>> contract, seems
> >>>>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that
> is
> >>>>>>>>>>>> explicitly for
> >>>>>>>>>>>>>>>>>>>>>>> producer
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> (key
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner
> those all
> >>>>>>>>>>>> are producer
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is
> topic level
> >>>>>>>>>>>>>>>>>>>>>>> configuration. And
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
> >>>>>>>>>>>> configurations
> >>>>>>>>>>>>>>>>>>>>>>> together in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface,
> seems like
> >>>>>>>>>>>> Produced
> >>>>>>>>>>>>>>>>>>>>>>> parameter is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal
> (e.g
> >>>>>>>>>> topics
> >>>>>>>>>>>> created
> >>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> managed
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I
> think we
> >>>>>>>>>>>> should leave
> >>>>>>>>>>>>>>>>>>>>>>> it as
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> it is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in that case.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account,
> I think we
> >>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>>>> distinguish
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka
> Streams should
> >>>>>>>>>>>> create and
> >>>>>>>>>>>>>>>>>>>>>>> manage
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs
> topics that
> >>>>>>>>>>>> should be
> >>>>>>>>>>>>>>>>>>>>>>> created in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding
> numPartitions
> >>>>>>>>>>>> configuration in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer
> level
> >>>>>>>>>>>> configuration in
> >>>>>>>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API
> semantics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional
> in
> >>>>>>>>>>>> KStream#through - I
> >>>>>>>>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving
> users
> >>>>>>>>>>>> possibility to
> >>>>>>>>>>>>>>>>>>>>>>> specify
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> num
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :)
> >>>>>>>>>> Considering
> >>>>>>>>>>>> arguments
> >>>>>>>>>>>>>>>>>>>>>>> against
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced
> class, I see two
> >>>>>>>>>>>> options
> >>>>>>>>>>>>>>>>>>>>>>> here:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
> auto-generated and
> >>>>>>>>>>>> num of
> >>>>>>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions)
> - topic
> >>>>>>>>>> will
> >>>>>>>>>>>> be auto
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions,
> final
> >>>>>>>>>>>> Produced<K, V>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated
> with
> >>>>>>>>>>>> specified num of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
> parameter.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and
> introduce
> >>>>>>>>>>>> new method
> >>>>>>>>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias
> suggested this
> >>>>>>>>>>>> in one of
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> threads)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I
> propose the
> >>>>>>>>>>>> following
> >>>>>>>>>>>>>>>>>>>>> plan:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
> configuration to
> >>>>>>>>>> Grouped
> >>>>>>>>>>>> class (as
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
> >>>>>>>>>>>> KStream#through
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be
> auto-generated and
> >>>>>>>>>>>> num of
> >>>>>>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions)
> - topic
> >>>>>>>>>> will
> >>>>>>>>>>>> be auto
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions,
> final
> >>>>>>>>>>>> Produced<K, V>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated
> with
> >>>>>>>>>>>> specified num of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced
> parameter.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions
> configuration to
> >>>>>>>>>> Grouped
> >>>>>>>>>>>> class (as
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator
> KStream#repartition for
> >>>>>>>>>>>> creating and
> >>>>>>>>>>>>>>>>>>>>>>> managing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was
> already
> >>>>>>>>>>>> discussed in the
> >>>>>>>>>>>>>>>>>>>>>>> mailing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads
> that were
> >>>>>>>>>>>> about this
> >>>>>>>>>>>>>>>>>>>>>>> KIP :(
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani
> Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> >>>>>>>>>> levani.codes@gmail.com>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion
> around
> >>>>>>>>>>>> KIP-221. Going
> >>>>>>>>>>>>>>>>>>>>>>> through
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to
> agreement
> >>>>>>>>>>>> around
> >>>>>>>>>>>>>>>>>>>>>>> usefulness of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feature.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far
> as I
> >>>>>>>>>>>> understood, the
> >>>>>>>>>>>>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the
> following:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to
> KStream#through
> >>>>>>>>>>>> method
> >>>>>>>>>>>>>>>>>>>>>>> (essentially
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with
> numOfPartitions
> >>>>>>>>>>>>>>>>>>>>> configuration
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> field.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL
> users to
> >>>>>>>>>> control
> >>>>>>>>>>>>>>>>>>>>>>> parallelism and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful
> >>>>>>>>>> operations.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface
> changes around
> >>>>>>>>>>>>>>>>>>>>>>> KStream#through if
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >>>
> >>
>
>

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by "Matthias J. Sax" <ma...@confluent.io>.
Side remark:

If the user specifies `repartition()` on both side of the join, we can
actually throw the execption earlier, ie, when we build the topology.

Current, we can do this check only after Kafka Streams was started,
within `StreamPartitionAssignor#assign()` -- we still need to keep this
check for the case that none or only one side has a user specified
number of partitions though.


-Matthias

On 11/14/19 8:15 AM, John Roesler wrote:
> Thanks, all,
> 
> I can get behind just totally leaving out reparation-via-groupBy. If
> we only introduce `repartition()` for now, we're making the minimal
> change to gain the desired capability.
> 
> Plus, since we agree that `repartition()` should never be optimizable,
> it's a future-compatible proposal. I.e., if we were to add a
> non-optimizable groupBy(partitions) operation now, and want to make it
> optimizable in the future, we have to worry about topology
> compatibility. Better to just do non-optimizable `repartition()` now,
> and add an optimizable `groupBy(partitions)` in the future (maybe).
> 
> About joins, yes, it's a concern, and IMO we should just do the same
> thing we do now... check at runtime that the partition counts on both
> sides match and throw an exception otherwise. What this means as a
> user is that if you explicitly repartition the left side to 100
> partitions, and then join with the right side at 10 partitions, you
> get an exception, since this operation is not possible. You'd either
> have to "step down" the left side again, back to 10 partitions, or you
> could repartition the right side to 100 partitions before the join.
> The choice has to be the user's, since it depends on their desired
> execution parallelism.
> 
> Thanks,
> -John
> 
> On Thu, Nov 14, 2019 at 12:55 AM Matthias J. Sax <ma...@confluent.io> wrote:
>>
>> Thanks a lot John. I think the way you decompose the operators is super
>> helpful for this discussion.
>>
>> What you suggest with regard to using `Grouped` and enforcing
>> repartitioning if the number of partitions is specified is certainly
>> possible. However, I am not sure if we _should_ do this. My reasoning is
>> that an enforce repartitioning as introduced via `repartition()` is an
>> expensive operations, and it seems better to demand an more explicit
>> user opt-in to trigger it. Just setting an optional parameter might be
>> too subtle to trigger such a heavy "side effect".
>>
>> While I agree about "usability" in general, I would prefer a more
>> conservative appraoch to introduce this feature, see how it goes, and
>> maybe make it more advance later on. This also applies to what
>> optimzation we may or may not allow (or are able to perform at all).
>>
>> @Levani: Reflecting about my suggestion about `Repartioned extends
>> Grouped`, I agree that it might not be a good idea.
>>
>> Atm, I see an enforces repartitioning as non-optimizable and as a good
>> first step and I would suggest to not intoruce anything else for now.
>> Introducing optimizable enforce repartitioning via `groupBy(...,
>> Grouped)` is something we could add later.
>>
>>
>> Therefore, I would not change `Grouped` but only introduce
>> `repartition()`. Users that use `grouBy()` atm, and want to opt-in to
>> set the number of partitions, would need to rewrite their code to
>> `selectKey(...).repartition(...).groupByKey()`. It's less convinient but
>> also less risky from an API and optimization point of view.
>>
>>
>> @Levani: about joins -> yes, we will need to check the specified number
>> of partitions (if any) and if they don't match, throw an exception. We
>> can discuss this on the PR -- I am just trying to get the PR for KIP-466
>> merged -- your is next on the list :)
>>
>>
>> Thoughts?
>>
>>
>> -Matthias
>>
>>
>> On 11/12/19 4:51 PM, Levani Kokhreidze wrote:
>>> Thank you all for an interesting discussion. This is very enlightening.
>>>
>>> Thank you Matthias for your explanation. Your arguments are very true. It makes sense that if user specifies number of partitions he/she really cares that those specifications are applied to internal topics.
>>> Unfortunately, in current implementation this is not true during `join` operation. As I’ve written in the PR comment, currently, when `Stream#join` is used, `CopartitionedTopicsEnforcer` chooses max number of partitions from the two source topics.
>>> I’m not really sure what would be the other way around this situation. Maybe fail the stream altogether and inform the user to specify same number of partitions?
>>> Or we should treat join operations in a same way as it is right now and basically choose max number of partitions even when `repartition` operation is specified, because Kafka Streams “knows the best” how to handle joins?
>>> You can check integration tests how it’s being handled currently. Open to suggestions on that part.
>>>
>>> As for groupBy, I agree and John raised very interesting points. My arguments for allowing users to specify number of partitions during groupBy operations mainly was coming from the usability perspective.
>>> So building on top of what John said, maybe it makes sense to make `groupBy` operations smarter and whenever user specifies `numberOfPartitions` configuration, repartitioning will be enforced, wdyt?
>>> I’m not going into optimization part yet :) I think it will be part of separate PR and task, but overall it makes sense to apply optimizations where number of partitions are the same.
>>>
>>> As for Repartitioned extending Grouped, I kinda feel that it won’t fit nicely in current API design.
>>> In addition, in the PR review, John mentioned that there were a lot of troubles in the past trying to use one operation's configuration objects on other operations.
>>> Also it makes sense to keep them separate in terms of compatibility.
>>> In that case, we don’t have to worry every time Grouped is changed, what would be the implications on `repartition` operations.
>>>
>>> Kind regards,
>>> Levani
>>>
>>>
>>>> On Nov 11, 2019, at 9:13 PM, John Roesler <jo...@confluent.io> wrote:
>>>>
>>>> Ah, thanks for the clarification. I missed your point.
>>>>
>>>> I like the framework you've presented. It does seem simpler to assume
>>>> that they either care about the partition count and want to
>>>> repartition to realize it, or they don't care about the number.
>>>> Returning to this discussion, it does seem unlikely that they care
>>>> about the number and _don't_ care if it actually gets realized.
>>>>
>>>> But then, it still seems like we can just keep the option as part of
>>>> Grouped. As in:
>>>>
>>>> // user does not care
>>>> stream.groupByKey(Grouped /*not specifying partition count*/)
>>>> stream.groupBy(Grouped /*not specifying partition count*/)
>>>>
>>>> // user does care
>>>> stream.repartition(Repartitioned)
>>>> stream.groupByKey(Grouped.numberOfPartitions(...))
>>>> stream.groupBy(Grouped.numberOfPartitions(...))
>>>>
>>>> ----
>>>>
>>>> The above discussion got me thinking about algebra. Matthias is
>>>> absolutely right that `groupByKey(numPartitions)` is equivalent to
>>>> `repartition(numPartitions).groupByKey()`. I'm just not convinced that
>>>> we should force people to apply that expansion themselves vs. having a
>>>> more compact way to express it if they don't care where exactly the
>>>> repartition occurs. However, thinking about these operators
>>>> algebraically can really help *us* narrow down the number of different
>>>> expressions we have to consider.
>>>>
>>>> Let's consider some identities:
>>>>
>>>> A: groupBy(mapper) + agg = mapKey(mapper) + groupByKey + agg
>>>> B: src + ... + groupByKey + agg = src + ... + passthough + agg
>>>> C: mapKey(mapper) + ... + groupByKey + agg
>>>> = mapKey(mapper) + ... + repartition + groupByKey + agg
>>>> D: repartition = sink(managed) + src
>>>>
>>>> In these identities, I used one special identifier (...), which means
>>>> any number (0+) of operations that are not src, mapKey, groupBy[Key],
>>>> repartition, or agg.
>>>>
>>>> For mental clarity, I'm just going to make up a rule that groupBy
>>>> operations are not executable. In other words, you have to get to a
>>>> point where you can apply B to convert a groupByKey into a passthough
>>>> in order to execute the program. This is just a formal way of stating
>>>> what already happens in Kafka Streams.
>>>>
>>>> By applying A, we can just completely leave `groupBy` out of our
>>>> analysis. It trivially decomposes into a mapKey followed by a
>>>> groupByKey.
>>>>
>>>> Then, we can eliminate the "repartition required" case of `groupByKey`
>>>> by applying C followed by D to get to the "no repartition required"
>>>> version of groupByKey, which in turn sets us up to apply B to get an
>>>> executable topology.
>>>>
>>>> Fundamentally, you can think about KIP-221 is as proposing a modified
>>>> D identity in which you can specify the partition count of the managed
>>>> sink topic:
>>>> D': repartition(pc) = sink(managed w/ pc) + src
>>>>
>>>> Since users _could_ apply the identities above, we don't actually have
>>>> to add any partition count to groupBy[Key], but we decided early on in
>>>> the KIP discussion that it's more ergonomic to add it. In that case,
>>>> we also have to modify A and C:
>>>> A': groupBy(mapper, pc) + agg
>>>> = mapKey(mapper) + groupByKey(pc) + agg
>>>> C': mapKey(mapper) + ... + groupByKey(pc) + agg
>>>> = mapKey(mapper) + ... + repartition(pc) + groupByKey + agg
>>>>
>>>> Which sets us up still to always be able to get back to a plain
>>>> `groupByKey` operation (with no `(pc)`) and then apply D' and
>>>> ultimately B to get an executable topology.
>>>>
>>>> What about the optimizer?
>>>> The optimizer applies another set of graph-algebraic identities to
>>>> minimize the number of repartition topics in a topology.
>>>>
>>>> (forgive my ascii art)
>>>>
>>>> E: (merging repartition nodes)
>>>> (...) -> repartition -> X
>>>>  \-> repartition -> Y
>>>> =
>>>> (... + repartition) -> X
>>>>     \-> Y
>>>> F: (reordering around repartition)
>>>> Where SVO is any non-key-changing, stateless, operation:
>>>> repartition -> SVO = SVO -> repartition
>>>>
>>>> In terms of these identities, what the optimizer does is apply F
>>>> repeatedly in either direction to a topology to factor out common in
>>>> branches so that it can apply E to merge repartition nodes. This was
>>>> especially necessary before KIP-221 because you couldn't directly
>>>> express `repartition` in the DSL, only indirectly via `groupBy[Key]`,
>>>> so there was no way to do the factoring manually.
>>>>
>>>> We can now state very clearly that in KIP-221, explicit
>>>> `repartition()` operators should create a "reordering barrier". So, F
>>>> cannot be applied to an explicit `repartition()`. Also, I think we
>>>> decided earlier that explicit `repartition()` operations would also be
>>>> ineligible for merging, so E can't be applied to explicit
>>>> `repartition()` operations either. I think we feel we _could_ apply E
>>>> without harm, but we want to be conservative for now.
>>>>
>>>> I think the salient point from the latter discussion has been that
>>>> when you use `Grouped.numberOfPartitions`, this does _not_ constitute
>>>> an explicit `repartition()` operator, and therefore, the resulting
>>>> repartition node remains eligible for optimization.
>>>>
>>>> To be clear, I agree with Matthias that the provided partition count
>>>> _must_ be used in the resulting implicit repartition. This has some
>>>> implications for E. Namely, E could only be applied to two repartition
>>>> nodes that have the same partition count. This has always been
>>>> trivially true before KIP-221 because the partition count has always
>>>> been "unspecified", i.e., it would be determined at runtime by the
>>>> user-managed-topics' partition counts. Now, it could be specified or
>>>> unspecified. We can simply augment E to allow merging only repartition
>>>> nodes where the partition count is EITHER "specified and the same on
>>>> both sides", OR "unspecified on both sides".
>>>>
>>>> Sorry for the long email, but I have a hope that it builds a solid
>>>> theoretical foundation for our decisions in KIP-221, so we can have
>>>> confidence that there are no edge cases for design flaws to hide.
>>>>
>>>> Thanks,
>>>> -John
>>>>
>>>> On Sat, Nov 9, 2019 at 10:37 PM Matthias J. Sax <ma...@confluent.io> wrote:
>>>>>
>>>>>> it seems like we do want to allow
>>>>>>> people to optionally specify a partition count as part of this
>>>>>>> operation, but we don't want that option to _force_ repartitioning
>>>>>
>>>>> Correct, ie, that is my suggestions.
>>>>>
>>>>>> "Use P partitions if repartitioning is necessary"
>>>>>
>>>>> I disagree here, because my reasoning is that:
>>>>>
>>>>> - if a user cares about the number of partition, the user wants to
>>>>> enforce a repartitioning
>>>>> - if a user does not case about the number of partitions, we don't need
>>>>> to provide them a way to pass in a "hint"
>>>>>
>>>>> Hence, it should be sufficient to support:
>>>>>
>>>>> // user does not care
>>>>>
>>>>>  `stream.groupByKey(Grouped)`
>>>>>  `stream.grouBy(..., Grouped)`
>>>>>
>>>>> // user does care
>>>>>
>>>>>  `stream.repartition(Repartitioned).groupByKey()`
>>>>>  `streams.groupBy(..., Repartitioned)`
>>>>>
>>>>>
>>>>>
>>>>> -Matthias
>>>>>
>>>>>
>>>>> On 11/9/19 8:10 PM, John Roesler wrote:
>>>>>> Thanks for those thoughts, Matthias,
>>>>>>
>>>>>> I find your reasoning about the optimization behavior compelling. The
>>>>>> `through` operation is very simple and clear to reason about. It just
>>>>>> passes the data exactly at the specified point in the topology exactly
>>>>>> through the specified topic. Likewise, if a user invokes a
>>>>>> `repartition` operator, the simplest behavior is if we just do what
>>>>>> they asked for.
>>>>>>
>>>>>> Stepping back to think about when optimizations are surprising and
>>>>>> when they aren't, it occurs to me that we should be free to move
>>>>>> around repartitions when users have asked to perform some operation
>>>>>> that implies a repartition, like "change keys, then filter, then
>>>>>> aggregate". This program requires a repartition, but it could be
>>>>>> anywhere between the key change and the aggregation. On the other
>>>>>> hand, if they say, "change keys, then filter, then repartition, then
>>>>>> aggregate", it seems like they were pretty clear about their desire,
>>>>>> and we should just take it at face value.
>>>>>>
>>>>>> So, I'm sold on just literally doing a repartition every time they
>>>>>> invoke the `repartition` operator.
>>>>>>
>>>>>>
>>>>>> The "partition count" modifier for `groupBy`/`groupByKey` is more nuanced.
>>>>>>
>>>>>> What you said about `groupByKey` makes sense. If they key hasn't
>>>>>> actually changed, then we don't need to repartition before
>>>>>> aggregating. On the other hand, `groupBy` is specifically changing the
>>>>>> key as part of the grouping operation, so (as you said) we definitely
>>>>>> have to do a repartition.
>>>>>>
>>>>>> If I'm reading the discussion right, it seems like we do want to allow
>>>>>> people to optionally specify a partition count as part of this
>>>>>> operation, but we don't want that option to _force_ repartitioning if
>>>>>> it's not needed. That last clause is the key. "Use P partitions if
>>>>>> repartitioning is necessary" is a directive that applies cleanly and
>>>>>> correctly to both `groupBy` and `groupByKey`. What if we call the
>>>>>> option `numberOfPartitionsHint`, which along with the "if necessary"
>>>>>> javadoc, should make it clear that the option won't force a
>>>>>> repartition, and also gives us enough latitude to still employ the
>>>>>> optimizer on those repartition topics?
>>>>>>
>>>>>> If we like the idea of expressing it as a "hint" for grouping and a
>>>>>> "command" for `repartition`, then it seems like it still makes sense
>>>>>> to keep Grouped and Repartitioned separate, as they would actually
>>>>>> offer different methods with distinct semantics.
>>>>>>
>>>>>> WDYT?
>>>>>>
>>>>>> Thanks,
>>>>>> -John
>>>>>>
>>>>>> On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <ma...@confluent.io> wrote:
>>>>>>>
>>>>>>> Sorry for late reply.
>>>>>>>
>>>>>>> I guess, the question boils down to the intended semantics of
>>>>>>> `repartition()`. My understanding is as follows:
>>>>>>>
>>>>>>> - KS does auto-repartitioning for correctness reasons (using the
>>>>>>> upstream topic to determine the number of partitions)
>>>>>>> - KS does auto-repartitioning only for downstream DSL operators like
>>>>>>> `count()` (eg, a `transform()` does never trigger an auto-repartitioning
>>>>>>> even if the stream is marked as `repartitioningRequired`).
>>>>>>> - KS offers `through()` to enforce a repartitioning -- however, the user
>>>>>>> needs to create the topic manually (with the desired number of partitions).
>>>>>>>
>>>>>>> I see two main applications for `repartitioning()`:
>>>>>>>
>>>>>>> 1) repartition data before a `transform()` but user does not want to
>>>>>>> manage the topic
>>>>>>> 2) scale out a downstream subtopology
>>>>>>>
>>>>>>> Hence, I see `repartition()` similar to `through()`: if a users calls
>>>>>>> it, a repartitining is enforced, with the difference that KS manages the
>>>>>>> topic and the user does not need to create it.
>>>>>>>
>>>>>>> This behavior makes (1) and (2) possible.
>>>>>>>
>>>>>>>> I think many users would prefer to just say "if there *is* a repartition
>>>>>>>> required at this point in the topology, it should
>>>>>>>> have N partitions"
>>>>>>>
>>>>>>> Because of (2), I disagree. Either a user does not care about scaling
>>>>>>> out, for which case she would not specify the number of partitions. Or a
>>>>>>> user does care, and hence wants to enforce the scale out. I don't think
>>>>>>> that any user would say, "maybe scale out".
>>>>>>>
>>>>>>> Therefore, the optimizer should never ignore the repartition operation.
>>>>>>> As a "consequence" (because repartitioning is expensive) a user should
>>>>>>> make an explicit call to `repartition()` IMHO -- piggybacking an
>>>>>>> enforced repartitioning into `groupByKey()` seems to be "dangerous"
>>>>>>> because it might be too subtle and an "optional scaling out" as laid out
>>>>>>> above does not make sense IMHO.
>>>>>>>
>>>>>>> I am also not worried about "over repartitioning" because the result
>>>>>>> stream would never trigger auto-repartitioning. Only if multiple
>>>>>>> consecutive calls to `repartition()` are made it could be bad -- but
>>>>>>> that's the same with `through()`. In the end, there is always some
>>>>>>> responsibility on the user.
>>>>>>>
>>>>>>> Btw, for `.groupBy()` we know that repartitioning will be required,
>>>>>>> however, for `groupByKey()` it depends if the KStream is marked as
>>>>>>> `repartitioningRequired`.
>>>>>>>
>>>>>>> Hence, for `groupByKey()` it should not be possible for a user to set
>>>>>>> number of partitions IMHO. For `groupBy()` it's a different story,
>>>>>>> because calling
>>>>>>>
>>>>>>>   `repartition().groupBy()`
>>>>>>>
>>>>>>> does not achieve what we want. Hence, allowing users to pass in the
>>>>>>> number of users partitions into `groupBy()` does actually makes sense,
>>>>>>> because repartitioning will happen anyway and thus we can piggyback a
>>>>>>> scaling decision.
>>>>>>>
>>>>>>> I think that John has a fair concern about the overloads, however, I am
>>>>>>> not convinced that using `Grouped` to specify the number of partitions
>>>>>>> is intuitive. I double checked `Grouped` and `Repartitioned` and both
>>>>>>> allow to specify a `name` and `keySerde/valueSerde`. Thus, I am
>>>>>>> wondering if we could bridge the gap between both, if we would make
>>>>>>> `Repartitioned extends Grouped`? For this case, we only need
>>>>>>> `groupBy(Grouped)` and a user can pass in both types what seems to make
>>>>>>> the API quite smooth:
>>>>>>>
>>>>>>>  `stream.groupBy(..., Grouped...)`
>>>>>>>
>>>>>>>  `stream.groupBy(..., Repartitioned...)`
>>>>>>>
>>>>>>>
>>>>>>> Thoughts?
>>>>>>>
>>>>>>>
>>>>>>> -Matthias
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
>>>>>>>> Hi Sophie,
>>>>>>>>
>>>>>>>> Thank you for your reply, very insightful. Looking forward hearing others opinion as well on this.
>>>>>>>>
>>>>>>>> Kind regards,
>>>>>>>> Levani
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <so...@confluent.io> wrote:
>>>>>>>>>
>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the other hand
>>>>>>>>> Kafka Streams has already
>>>>>>>>>> optimizer in place which alters topology independently from user
>>>>>>>>>
>>>>>>>>> I agree (with you) and think this is a good way to put it -- we currently
>>>>>>>>> auto-repartition for the user so
>>>>>>>>> that they don't have to walk through their entire topology and reason about
>>>>>>>>> when and where to place a
>>>>>>>>> `.through` (or the new `.repartition`), so why suddenly force this onto the
>>>>>>>>> user? How certain are we that
>>>>>>>>> users will always get this right? It's easy to imagine that during
>>>>>>>>> development, you write your new app with
>>>>>>>>> correctly placed repartitions in order to use this new feature. During the
>>>>>>>>> course of development you end up
>>>>>>>>> tweaking the topology, but don't remember to review or move the
>>>>>>>>> repartitioning since you're used to Streams
>>>>>>>>> doing this for you. If you use only single-partition topics for testing,
>>>>>>>>> you might not even notice your app is
>>>>>>>>> spitting out incorrect results!
>>>>>>>>>
>>>>>>>>> Anyways, I feel pretty strongly that it would be weird to introduce a new
>>>>>>>>> feature and say that to use it, you can't take
>>>>>>>>> advantage of this other feature anymore. Also, is it possible our
>>>>>>>>> optimization framework could ever include an
>>>>>>>>> optimized repartitioning strategy that is better than what a user could
>>>>>>>>> achieve by manually inserting repartitions?
>>>>>>>>> Do we expect users to have a deep understanding of the best way to
>>>>>>>>> repartition their particular topology, or is it
>>>>>>>>> likely they will end up over-repartitioning either due to missed
>>>>>>>>> optimizations or unnecessary extra repartitions?
>>>>>>>>> I think many users would prefer to just say "if there *is* a repartition
>>>>>>>>> required at this point in the topology, it should
>>>>>>>>> have N partitions"
>>>>>>>>>
>>>>>>>>> As to the idea of adding `numberOfPartitions` to Grouped rather than
>>>>>>>>> adding a new parameter to groupBy, that does seem more in line with the
>>>>>>>>> current syntax so +1 from me
>>>>>>>>>
>>>>>>>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <le...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello all,
>>>>>>>>>>
>>>>>>>>>> While https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>> https://github.com/apache/kafka/pull/7170> is under review and it’s
>>>>>>>>>> almost done, I want to resurrect discussion about this KIP to address
>>>>>>>>>> couple of concerns raised by Matthias and John.
>>>>>>>>>>
>>>>>>>>>> As a reminder, idea of the KIP-221 was to allow DSL users control over
>>>>>>>>>> repartitioning and parallelism of sub-topologies by:
>>>>>>>>>> 1) Introducing new KStream#repartition operation which is done in
>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>> https://github.com/apache/kafka/pull/7170>
>>>>>>>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which is planned to
>>>>>>>>>> be separate PR.
>>>>>>>>>>
>>>>>>>>>> While all agree about general implementation and idea behind
>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>> https://github.com/apache/kafka/pull/7170> PR, introducing new
>>>>>>>>>> KStream#groupBy(Repartitioned) method overload raised some questions during
>>>>>>>>>> the review.
>>>>>>>>>> Matthias raised concern that there can be cases when user uses
>>>>>>>>>> `KStream#groupBy(Repartitioned)` operation, but actual repartitioning may
>>>>>>>>>> not required, thus configuration passed via `Repartitioned` would never be
>>>>>>>>>> applied (Matthias, please correct me if I misinterpreted your comment).
>>>>>>>>>> So instead, if user wants to control parallelism of sub-topologies, he or
>>>>>>>>>> she should always use `KStream#repartition` operation before groupBy. Full
>>>>>>>>>> comment can be seen here:
>>>>>>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
>>>>>>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
>>>>>>>>>>
>>>>>>>>>> On the same topic, John pointed out that, from API design perspective, we
>>>>>>>>>> shouldn’t intertwine configuration classes of different operators between
>>>>>>>>>> one another. So instead of introducing new `KStream#groupBy(Repartitioned)`
>>>>>>>>>> for specifying number of partitions for internal topic, we should update
>>>>>>>>>> existing `Grouped` class with `numberOfPartitions` field.
>>>>>>>>>>
>>>>>>>>>> Personally, I think Matthias’s concern is valid, but on the other hand
>>>>>>>>>> Kafka Streams has already optimizer in place which alters topology
>>>>>>>>>> independently from user. So maybe it makes sense if Kafka Streams,
>>>>>>>>>> internally would optimize topology in the best way possible, even if in
>>>>>>>>>> some cases this means ignoring some operator configurations passed by the
>>>>>>>>>> user. Also, I agree with John about API design semantics. If we go through
>>>>>>>>>> with the changes for `KStream#groupBy` operation, it makes more sense to
>>>>>>>>>> add `numberOfPartitions` field to `Grouped` class instead of introducing
>>>>>>>>>> new `KStream#groupBy(Repartitioned)` method overload.
>>>>>>>>>>
>>>>>>>>>> I would really appreciate communities feedback on this.
>>>>>>>>>>
>>>>>>>>>> Kind regards,
>>>>>>>>>> Levani
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <so...@confluent.io>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>
>>>>>>>>>>> I think people are busy with the upcoming 2.4 release, and don't have
>>>>>>>>>> much
>>>>>>>>>>> spare time at the
>>>>>>>>>>> moment. It's kind of a difficult time to get attention on things, but
>>>>>>>>>> feel
>>>>>>>>>>> free to pick up something else
>>>>>>>>>>> to work on in the meantime until things have calmed down a bit!
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Sophie
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry for bringing this thread again, but I would like to get some
>>>>>>>>>>>> attention on this PR: https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>> https://github.com/apache/kafka/pull/7170> <
>>>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>>>> https://github.com/apache/kafka/pull/7170>>
>>>>>>>>>>>> It's been a while now and I would love to move on to other KIPs as well.
>>>>>>>>>>>> Please let me know if you have any concerns.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Levani
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here’s voting thread for this KIP:
>>>>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html> <
>>>>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion on that one.
>>>>>>>>>>>>>> Agree that avoiding unnecessary method overloads is a good idea.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Updated KIP
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <matthias@confluent.io
>>>>>>>>>> <ma...@confluent.io>
>>>>>>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One question:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Why do we add
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Repartitioned#with(final String name, final int numberOfPartitions)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
>>>>>>>>>>>>>>> combination with `withName()` and `withNumberOfPartitions()` should
>>>>>>>>>> be
>>>>>>>>>>>>>>> sufficient. Users can chain the method calls.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (I think it's valuable to keep the number of overload small if
>>>>>>>>>>>> possible.)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Otherwise LGTM.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks all for your feedback.
>>>>>>>>>>>>>>>> I started voting procedure for this KIP. If there’re any other
>>>>>>>>>>>> concerns about this KIP, please let me know.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for the suggestion, makes sense.
>>>>>>>>>>>>>>>>> I’ve updated KIP (
>>>>>>>>>>>>
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>> <
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>
>>>>>>>>>>>> <
>>>>>>>>>>>>
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>> <
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>
>>>>>>>>>>>> <
>>>>>>>>>>>>
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>> <
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>
>>>>>>>>>>>> <
>>>>>>>>>>>>
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>> <
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>
>>>>>>>>>>>>>> ).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>
>>>>>>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>> matthias@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for driving the KIP.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I agree that users need to be able to specify a partitioning
>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Sophie raises a fair point about topic configs and producer
>>>>>>>>>>>> configs. My
>>>>>>>>>>>>>>>>>> take is, that consider `Repartitioned` as an "extension" to
>>>>>>>>>>>> `Produced`,
>>>>>>>>>>>>>>>>>> that adds topic configuration, is a good way to think about it and
>>>>>>>>>>>> helps
>>>>>>>>>>>>>>>>>> to keep the API "clean".
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> With regard to method names. I would prefer to avoid
>>>>>>>>>> abbreviations.
>>>>>>>>>>>> Can
>>>>>>>>>>>>>>>>>> we rename:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Furthermore, it might be good to add some more `static` methods:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>>>>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>>>>>>>>>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>> Totally agree. I think in KStream interface it makes sense to
>>>>>>>>>> have
>>>>>>>>>>>> some duplicate configurations between operators in order to keep API
>>>>>>>>>> simple
>>>>>>>>>>>> and usable.
>>>>>>>>>>>>>>>>>>> Also, as more surface API has, harder it is to have proper
>>>>>>>>>>>> backward compatibility.
>>>>>>>>>>>>>>>>>>> While initial idea of keeping topic level configs separate was
>>>>>>>>>>>> exciting, having Repartitioned class encapsulate some producer level
>>>>>>>>>>>> configs makes API more readable.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I think that is a good point about trying to keep producer level
>>>>>>>>>>>>>>>>>>>> configurations and (repartition) topic level considerations
>>>>>>>>>>>> separate.
>>>>>>>>>>>>>>>>>>>> Number of partitions is definitely purely a topic level
>>>>>>>>>>>> configuration. But
>>>>>>>>>>>>>>>>>>>> on some level, serdes and partitioners are just as much a topic
>>>>>>>>>>>>>>>>>>>> configuration as a producer one. You could have two producers
>>>>>>>>>>>> configured
>>>>>>>>>>>>>>>>>>>> with different serdes and/or partitioners, but if they are
>>>>>>>>>>>> writing to the
>>>>>>>>>>>>>>>>>>>> same topic the result would be very difficult to part. So in a
>>>>>>>>>>>> sense, these
>>>>>>>>>>>>>>>>>>>> are configurations of topics in Streams, not just producers.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Another way to think of it: while the Streams API is not always
>>>>>>>>>>>> true to
>>>>>>>>>>>>>>>>>>>> this, ideally all the relevant configs for an operator are
>>>>>>>>>>>> wrapped into a
>>>>>>>>>>>>>>>>>>>> single object (in this case, Repartitioned). We could instead
>>>>>>>>>>>> split out the
>>>>>>>>>>>>>>>>>>>> fields in common with Produced into a separate parameter to keep
>>>>>>>>>>>> topic and
>>>>>>>>>>>>>>>>>>>> producer level configurations separate, but this increases the
>>>>>>>>>>>> API surface
>>>>>>>>>>>>>>>>>>>> area by a lot. It's much more straightforward to just say "this
>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> everything that this particular operator needs" without worrying
>>>>>>>>>>>> about what
>>>>>>>>>>>>>>>>>>>> exactly you're specifying.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I suppose you could alternatively make Produced a field of
>>>>>>>>>>>> Repartitioned,
>>>>>>>>>>>>>>>>>>>> but I don't think we do this kind of composition elsewhere in
>>>>>>>>>>>> Streams at
>>>>>>>>>>>>>>>>>>>> the moment
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>> <mailto:
>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Bill,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
>>>>>>>>>>>> `Repartitioned#partitioner`
>>>>>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class for topic level
>>>>>>>>>>>>>>>>>>>>> configuration and keep topic level and producer level
>>>>>>>>>>>> configurations (such
>>>>>>>>>>>>>>>>>>>>> as Produced) separately (see my second email in this thread).
>>>>>>>>>>>>>>>>>>>>> But while looking at the semantics of KStream interface, I
>>>>>>>>>>>> couldn’t really
>>>>>>>>>>>>>>>>>>>>> figure out good operation name for Topic level configuration
>>>>>>>>>>>> class and just
>>>>>>>>>>>>>>>>>>>>> introducing `Topic` config class was kinda breaking the
>>>>>>>>>>>> semantics.
>>>>>>>>>>>>>>>>>>>>> So I think having Repartitioned class which encapsulates topic
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> producer level configurations for internal topics is viable
>>>>>>>>>>>> thing to do.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <bbejeck@gmail.com
>>>>>>>>>> <ma...@gmail.com>
>>>>>>>>>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>> <mailto:
>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Lavani,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.  In addition to
>>>>>>>>>>>> the reason
>>>>>>>>>>>>>>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 1. Users may want to use something other than hash-based
>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>>>>>>>>> 2. Users may wish to partition on something different than the
>>>>>>>>>>>> key
>>>>>>>>>>>>>>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>>>>>>>>>>>>>> 1. A combination of fields in the value in conjunction with
>>>>>>>>>>>> the key
>>>>>>>>>>>>>>>>>>>>>> 2. Something other than the key
>>>>>>>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence
>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes sense for API
>>>>>>>>>>>> consistency.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Just my  2 cents.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>> <mailto:
>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>>>>>>>>>>>>>>> If we add partitioner configuration to Repartitioned class,
>>>>>>>>>>>> with the
>>>>>>>>>>>>>>>>>>>>>>> combination of specifying number of partitions for internal
>>>>>>>>>>>> topics, user
>>>>>>>>>>>>>>>>>>>>>>> will have opportunity to ensure co-partitioning before join
>>>>>>>>>>>> operation.
>>>>>>>>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>>>>>>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
>>>>>>>>>> john@confluent.io <ma...@confluent.io>
>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally
>>>>>>>>>>>> sure it
>>>>>>>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar is the
>>>>>>>>>> rationale
>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hey John,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your example I would
>>>>>>>>>>>> first issue
>>>>>>>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner that
>>>>>>>>>>>> essentially
>>>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>>> be the same as the topic I want to join with and then do the
>>>>>>>>>>>>>>>>>>>>> KStream#join
>>>>>>>>>>>>>>>>>>>>>>> with DSL?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
>>>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:john@confluent.io
>>>>>>>>>> <ma...@confluent.io>> <mailto:john@confluent.io <mailto:
>>>>>>>>>> john@confluent.io>
>>>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an option to specify
>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may
>>>>>>>>>>>> get
>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the
>>>>>>>>>>>> right-side
>>>>>>>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
>>>>>>>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the
>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss something
>>>>>>>>>>>> important?
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
>>>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m
>>>>>>>>>> not
>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would
>>>>>>>>>>>> need control
>>>>>>>>>>>>>>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best
>>>>>>>>>>>> how to
>>>>>>>>>>>>>>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for
>>>>>>>>>>>> topics
>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to
>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> partitioner.
>>>>>>>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess
>>>>>>>>>>>> specifying the
>>>>>>>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s agreement on
>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned
>>>>>>>>>>>> would be a
>>>>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
>>>>>>>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to
>>>>>>>>>>>> Produced? I'm not
>>>>>>>>>>>>>>>>>>>>>>> sure how
>>>>>>>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but seems it should
>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over
>>>>>>>>>>>> the topic
>>>>>>>>>>>>>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that
>>>>>>>>>> is
>>>>>>>>>>>> missing
>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by
>>>>>>>>>>>> configuring
>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me
>>>>>>>>>> --
>>>>>>>>>>>> for the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned,
>>>>>>>>>>>> will the
>>>>>>>>>>>>>>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case
>>>>>>>>>> for
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> signature
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be
>>>>>>>>>>>> pre-created? The
>>>>>>>>>>>>>>>>>>>>>>> wording
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method
>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>> <
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>
>>>>>>>>>>>> <
>>>>>>>>>>>>
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>> <
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>
>>>>>>>>>>>> <
>>>>>>>>>>>>
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>> <
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>
>>>>>>>>>>>> <
>>>>>>>>>>>>
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>> <
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>> <
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>
>>>>>>>>>>>> <
>>>>>>>>>>>>
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns
>>>>>>>>>> about
>>>>>>>>>>>> this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated
>>>>>>>>>>>> confluence
>>>>>>>>>>>>>>>>>>>>> page
>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic`
>>>>>>>>>>>> class for
>>>>>>>>>>>>>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators like
>>>>>>>>>> `groupedBy`.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this
>>>>>>>>>> as
>>>>>>>>>>>> well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
>>>>>>>>>>>>>>>>>>>>>>> matthias@confluent.io
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for
>>>>>>>>>>>> summarizing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed
>>>>>>>>>>>> already (can't
>>>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to
>>>>>>>>>>>> refresh the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard
>>>>>>>>>> to
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and
>>>>>>>>>> operators
>>>>>>>>>>>> that use
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might
>>>>>>>>>> indicate
>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses
>>>>>>>>>>>> use-defined
>>>>>>>>>>>>>>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an
>>>>>>>>>>>> operator that
>>>>>>>>>>>>>>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is
>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but
>>>>>>>>>>>> sometimes
>>>>>>>>>>>>>>>>>>>>> not; a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction
>>>>>>>>>>>> clearer.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am
>>>>>>>>>>>> wondering
>>>>>>>>>>>>>>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding
>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of
>>>>>>>>>>>> others on
>>>>>>>>>>>>>>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow
>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> configure
>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we
>>>>>>>>>>>> assume that
>>>>>>>>>>>>>>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method
>>>>>>>>>>>> instead of
>>>>>>>>>>>>>>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other
>>>>>>>>>> configs,
>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the
>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like
>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>>>>>> myself.
>>>>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe
>>>>>>>>>> enhancing
>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach.
>>>>>>>>>>>> Let me
>>>>>>>>>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this
>>>>>>>>>>>> configuration,
>>>>>>>>>>>>>>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is
>>>>>>>>>>>> the final
>>>>>>>>>>>>>>>>>>>>>>> sink of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable
>>>>>>>>>> assumption
>>>>>>>>>>>> that user
>>>>>>>>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that
>>>>>>>>>>>> case, having
>>>>>>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API
>>>>>>>>>>>> contract, seems
>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that is
>>>>>>>>>>>> explicitly for
>>>>>>>>>>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all
>>>>>>>>>>>> are producer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
>>>>>>>>>>>>>>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
>>>>>>>>>>>> configurations
>>>>>>>>>>>>>>>>>>>>>>> together in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like
>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g
>>>>>>>>>> topics
>>>>>>>>>>>> created
>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we
>>>>>>>>>>>> should leave
>>>>>>>>>>>>>>>>>>>>>>> it as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we
>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should
>>>>>>>>>>>> create and
>>>>>>>>>>>>>>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that
>>>>>>>>>>>> should be
>>>>>>>>>>>>>>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions
>>>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level
>>>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in
>>>>>>>>>>>> KStream#through - I
>>>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users
>>>>>>>>>>>> possibility to
>>>>>>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :)
>>>>>>>>>> Considering
>>>>>>>>>>>> arguments
>>>>>>>>>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two
>>>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
>>>>>>>>>> will
>>>>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>>>>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>>>>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce
>>>>>>>>>>>> new method
>>>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this
>>>>>>>>>>>> in one of
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the
>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
>>>>>>>>>> Grouped
>>>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
>>>>>>>>>>>> KStream#through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
>>>>>>>>>> will
>>>>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>>>>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>>>>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
>>>>>>>>>> Grouped
>>>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for
>>>>>>>>>>>> creating and
>>>>>>>>>>>>>>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already
>>>>>>>>>>>> discussed in the
>>>>>>>>>>>>>>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were
>>>>>>>>>>>> about this
>>>>>>>>>>>>>>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>>>>>>> levani.codes@gmail.com>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around
>>>>>>>>>>>> KIP-221. Going
>>>>>>>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement
>>>>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I
>>>>>>>>>>>> understood, the
>>>>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through
>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
>>>>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to
>>>>>>>>>> control
>>>>>>>>>>>>>>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful
>>>>>>>>>> operations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
>>>>>>>>>>>>>>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>
>>>
>>


Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by John Roesler <jo...@confluent.io>.
Thanks, all,

I can get behind just totally leaving out reparation-via-groupBy. If
we only introduce `repartition()` for now, we're making the minimal
change to gain the desired capability.

Plus, since we agree that `repartition()` should never be optimizable,
it's a future-compatible proposal. I.e., if we were to add a
non-optimizable groupBy(partitions) operation now, and want to make it
optimizable in the future, we have to worry about topology
compatibility. Better to just do non-optimizable `repartition()` now,
and add an optimizable `groupBy(partitions)` in the future (maybe).

About joins, yes, it's a concern, and IMO we should just do the same
thing we do now... check at runtime that the partition counts on both
sides match and throw an exception otherwise. What this means as a
user is that if you explicitly repartition the left side to 100
partitions, and then join with the right side at 10 partitions, you
get an exception, since this operation is not possible. You'd either
have to "step down" the left side again, back to 10 partitions, or you
could repartition the right side to 100 partitions before the join.
The choice has to be the user's, since it depends on their desired
execution parallelism.

Thanks,
-John

On Thu, Nov 14, 2019 at 12:55 AM Matthias J. Sax <ma...@confluent.io> wrote:
>
> Thanks a lot John. I think the way you decompose the operators is super
> helpful for this discussion.
>
> What you suggest with regard to using `Grouped` and enforcing
> repartitioning if the number of partitions is specified is certainly
> possible. However, I am not sure if we _should_ do this. My reasoning is
> that an enforce repartitioning as introduced via `repartition()` is an
> expensive operations, and it seems better to demand an more explicit
> user opt-in to trigger it. Just setting an optional parameter might be
> too subtle to trigger such a heavy "side effect".
>
> While I agree about "usability" in general, I would prefer a more
> conservative appraoch to introduce this feature, see how it goes, and
> maybe make it more advance later on. This also applies to what
> optimzation we may or may not allow (or are able to perform at all).
>
> @Levani: Reflecting about my suggestion about `Repartioned extends
> Grouped`, I agree that it might not be a good idea.
>
> Atm, I see an enforces repartitioning as non-optimizable and as a good
> first step and I would suggest to not intoruce anything else for now.
> Introducing optimizable enforce repartitioning via `groupBy(...,
> Grouped)` is something we could add later.
>
>
> Therefore, I would not change `Grouped` but only introduce
> `repartition()`. Users that use `grouBy()` atm, and want to opt-in to
> set the number of partitions, would need to rewrite their code to
> `selectKey(...).repartition(...).groupByKey()`. It's less convinient but
> also less risky from an API and optimization point of view.
>
>
> @Levani: about joins -> yes, we will need to check the specified number
> of partitions (if any) and if they don't match, throw an exception. We
> can discuss this on the PR -- I am just trying to get the PR for KIP-466
> merged -- your is next on the list :)
>
>
> Thoughts?
>
>
> -Matthias
>
>
> On 11/12/19 4:51 PM, Levani Kokhreidze wrote:
> > Thank you all for an interesting discussion. This is very enlightening.
> >
> > Thank you Matthias for your explanation. Your arguments are very true. It makes sense that if user specifies number of partitions he/she really cares that those specifications are applied to internal topics.
> > Unfortunately, in current implementation this is not true during `join` operation. As I’ve written in the PR comment, currently, when `Stream#join` is used, `CopartitionedTopicsEnforcer` chooses max number of partitions from the two source topics.
> > I’m not really sure what would be the other way around this situation. Maybe fail the stream altogether and inform the user to specify same number of partitions?
> > Or we should treat join operations in a same way as it is right now and basically choose max number of partitions even when `repartition` operation is specified, because Kafka Streams “knows the best” how to handle joins?
> > You can check integration tests how it’s being handled currently. Open to suggestions on that part.
> >
> > As for groupBy, I agree and John raised very interesting points. My arguments for allowing users to specify number of partitions during groupBy operations mainly was coming from the usability perspective.
> > So building on top of what John said, maybe it makes sense to make `groupBy` operations smarter and whenever user specifies `numberOfPartitions` configuration, repartitioning will be enforced, wdyt?
> > I’m not going into optimization part yet :) I think it will be part of separate PR and task, but overall it makes sense to apply optimizations where number of partitions are the same.
> >
> > As for Repartitioned extending Grouped, I kinda feel that it won’t fit nicely in current API design.
> > In addition, in the PR review, John mentioned that there were a lot of troubles in the past trying to use one operation's configuration objects on other operations.
> > Also it makes sense to keep them separate in terms of compatibility.
> > In that case, we don’t have to worry every time Grouped is changed, what would be the implications on `repartition` operations.
> >
> > Kind regards,
> > Levani
> >
> >
> >> On Nov 11, 2019, at 9:13 PM, John Roesler <jo...@confluent.io> wrote:
> >>
> >> Ah, thanks for the clarification. I missed your point.
> >>
> >> I like the framework you've presented. It does seem simpler to assume
> >> that they either care about the partition count and want to
> >> repartition to realize it, or they don't care about the number.
> >> Returning to this discussion, it does seem unlikely that they care
> >> about the number and _don't_ care if it actually gets realized.
> >>
> >> But then, it still seems like we can just keep the option as part of
> >> Grouped. As in:
> >>
> >> // user does not care
> >> stream.groupByKey(Grouped /*not specifying partition count*/)
> >> stream.groupBy(Grouped /*not specifying partition count*/)
> >>
> >> // user does care
> >> stream.repartition(Repartitioned)
> >> stream.groupByKey(Grouped.numberOfPartitions(...))
> >> stream.groupBy(Grouped.numberOfPartitions(...))
> >>
> >> ----
> >>
> >> The above discussion got me thinking about algebra. Matthias is
> >> absolutely right that `groupByKey(numPartitions)` is equivalent to
> >> `repartition(numPartitions).groupByKey()`. I'm just not convinced that
> >> we should force people to apply that expansion themselves vs. having a
> >> more compact way to express it if they don't care where exactly the
> >> repartition occurs. However, thinking about these operators
> >> algebraically can really help *us* narrow down the number of different
> >> expressions we have to consider.
> >>
> >> Let's consider some identities:
> >>
> >> A: groupBy(mapper) + agg = mapKey(mapper) + groupByKey + agg
> >> B: src + ... + groupByKey + agg = src + ... + passthough + agg
> >> C: mapKey(mapper) + ... + groupByKey + agg
> >> = mapKey(mapper) + ... + repartition + groupByKey + agg
> >> D: repartition = sink(managed) + src
> >>
> >> In these identities, I used one special identifier (...), which means
> >> any number (0+) of operations that are not src, mapKey, groupBy[Key],
> >> repartition, or agg.
> >>
> >> For mental clarity, I'm just going to make up a rule that groupBy
> >> operations are not executable. In other words, you have to get to a
> >> point where you can apply B to convert a groupByKey into a passthough
> >> in order to execute the program. This is just a formal way of stating
> >> what already happens in Kafka Streams.
> >>
> >> By applying A, we can just completely leave `groupBy` out of our
> >> analysis. It trivially decomposes into a mapKey followed by a
> >> groupByKey.
> >>
> >> Then, we can eliminate the "repartition required" case of `groupByKey`
> >> by applying C followed by D to get to the "no repartition required"
> >> version of groupByKey, which in turn sets us up to apply B to get an
> >> executable topology.
> >>
> >> Fundamentally, you can think about KIP-221 is as proposing a modified
> >> D identity in which you can specify the partition count of the managed
> >> sink topic:
> >> D': repartition(pc) = sink(managed w/ pc) + src
> >>
> >> Since users _could_ apply the identities above, we don't actually have
> >> to add any partition count to groupBy[Key], but we decided early on in
> >> the KIP discussion that it's more ergonomic to add it. In that case,
> >> we also have to modify A and C:
> >> A': groupBy(mapper, pc) + agg
> >> = mapKey(mapper) + groupByKey(pc) + agg
> >> C': mapKey(mapper) + ... + groupByKey(pc) + agg
> >> = mapKey(mapper) + ... + repartition(pc) + groupByKey + agg
> >>
> >> Which sets us up still to always be able to get back to a plain
> >> `groupByKey` operation (with no `(pc)`) and then apply D' and
> >> ultimately B to get an executable topology.
> >>
> >> What about the optimizer?
> >> The optimizer applies another set of graph-algebraic identities to
> >> minimize the number of repartition topics in a topology.
> >>
> >> (forgive my ascii art)
> >>
> >> E: (merging repartition nodes)
> >> (...) -> repartition -> X
> >>  \-> repartition -> Y
> >> =
> >> (... + repartition) -> X
> >>     \-> Y
> >> F: (reordering around repartition)
> >> Where SVO is any non-key-changing, stateless, operation:
> >> repartition -> SVO = SVO -> repartition
> >>
> >> In terms of these identities, what the optimizer does is apply F
> >> repeatedly in either direction to a topology to factor out common in
> >> branches so that it can apply E to merge repartition nodes. This was
> >> especially necessary before KIP-221 because you couldn't directly
> >> express `repartition` in the DSL, only indirectly via `groupBy[Key]`,
> >> so there was no way to do the factoring manually.
> >>
> >> We can now state very clearly that in KIP-221, explicit
> >> `repartition()` operators should create a "reordering barrier". So, F
> >> cannot be applied to an explicit `repartition()`. Also, I think we
> >> decided earlier that explicit `repartition()` operations would also be
> >> ineligible for merging, so E can't be applied to explicit
> >> `repartition()` operations either. I think we feel we _could_ apply E
> >> without harm, but we want to be conservative for now.
> >>
> >> I think the salient point from the latter discussion has been that
> >> when you use `Grouped.numberOfPartitions`, this does _not_ constitute
> >> an explicit `repartition()` operator, and therefore, the resulting
> >> repartition node remains eligible for optimization.
> >>
> >> To be clear, I agree with Matthias that the provided partition count
> >> _must_ be used in the resulting implicit repartition. This has some
> >> implications for E. Namely, E could only be applied to two repartition
> >> nodes that have the same partition count. This has always been
> >> trivially true before KIP-221 because the partition count has always
> >> been "unspecified", i.e., it would be determined at runtime by the
> >> user-managed-topics' partition counts. Now, it could be specified or
> >> unspecified. We can simply augment E to allow merging only repartition
> >> nodes where the partition count is EITHER "specified and the same on
> >> both sides", OR "unspecified on both sides".
> >>
> >> Sorry for the long email, but I have a hope that it builds a solid
> >> theoretical foundation for our decisions in KIP-221, so we can have
> >> confidence that there are no edge cases for design flaws to hide.
> >>
> >> Thanks,
> >> -John
> >>
> >> On Sat, Nov 9, 2019 at 10:37 PM Matthias J. Sax <ma...@confluent.io> wrote:
> >>>
> >>>> it seems like we do want to allow
> >>>>> people to optionally specify a partition count as part of this
> >>>>> operation, but we don't want that option to _force_ repartitioning
> >>>
> >>> Correct, ie, that is my suggestions.
> >>>
> >>>> "Use P partitions if repartitioning is necessary"
> >>>
> >>> I disagree here, because my reasoning is that:
> >>>
> >>> - if a user cares about the number of partition, the user wants to
> >>> enforce a repartitioning
> >>> - if a user does not case about the number of partitions, we don't need
> >>> to provide them a way to pass in a "hint"
> >>>
> >>> Hence, it should be sufficient to support:
> >>>
> >>> // user does not care
> >>>
> >>>  `stream.groupByKey(Grouped)`
> >>>  `stream.grouBy(..., Grouped)`
> >>>
> >>> // user does care
> >>>
> >>>  `stream.repartition(Repartitioned).groupByKey()`
> >>>  `streams.groupBy(..., Repartitioned)`
> >>>
> >>>
> >>>
> >>> -Matthias
> >>>
> >>>
> >>> On 11/9/19 8:10 PM, John Roesler wrote:
> >>>> Thanks for those thoughts, Matthias,
> >>>>
> >>>> I find your reasoning about the optimization behavior compelling. The
> >>>> `through` operation is very simple and clear to reason about. It just
> >>>> passes the data exactly at the specified point in the topology exactly
> >>>> through the specified topic. Likewise, if a user invokes a
> >>>> `repartition` operator, the simplest behavior is if we just do what
> >>>> they asked for.
> >>>>
> >>>> Stepping back to think about when optimizations are surprising and
> >>>> when they aren't, it occurs to me that we should be free to move
> >>>> around repartitions when users have asked to perform some operation
> >>>> that implies a repartition, like "change keys, then filter, then
> >>>> aggregate". This program requires a repartition, but it could be
> >>>> anywhere between the key change and the aggregation. On the other
> >>>> hand, if they say, "change keys, then filter, then repartition, then
> >>>> aggregate", it seems like they were pretty clear about their desire,
> >>>> and we should just take it at face value.
> >>>>
> >>>> So, I'm sold on just literally doing a repartition every time they
> >>>> invoke the `repartition` operator.
> >>>>
> >>>>
> >>>> The "partition count" modifier for `groupBy`/`groupByKey` is more nuanced.
> >>>>
> >>>> What you said about `groupByKey` makes sense. If they key hasn't
> >>>> actually changed, then we don't need to repartition before
> >>>> aggregating. On the other hand, `groupBy` is specifically changing the
> >>>> key as part of the grouping operation, so (as you said) we definitely
> >>>> have to do a repartition.
> >>>>
> >>>> If I'm reading the discussion right, it seems like we do want to allow
> >>>> people to optionally specify a partition count as part of this
> >>>> operation, but we don't want that option to _force_ repartitioning if
> >>>> it's not needed. That last clause is the key. "Use P partitions if
> >>>> repartitioning is necessary" is a directive that applies cleanly and
> >>>> correctly to both `groupBy` and `groupByKey`. What if we call the
> >>>> option `numberOfPartitionsHint`, which along with the "if necessary"
> >>>> javadoc, should make it clear that the option won't force a
> >>>> repartition, and also gives us enough latitude to still employ the
> >>>> optimizer on those repartition topics?
> >>>>
> >>>> If we like the idea of expressing it as a "hint" for grouping and a
> >>>> "command" for `repartition`, then it seems like it still makes sense
> >>>> to keep Grouped and Repartitioned separate, as they would actually
> >>>> offer different methods with distinct semantics.
> >>>>
> >>>> WDYT?
> >>>>
> >>>> Thanks,
> >>>> -John
> >>>>
> >>>> On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <ma...@confluent.io> wrote:
> >>>>>
> >>>>> Sorry for late reply.
> >>>>>
> >>>>> I guess, the question boils down to the intended semantics of
> >>>>> `repartition()`. My understanding is as follows:
> >>>>>
> >>>>> - KS does auto-repartitioning for correctness reasons (using the
> >>>>> upstream topic to determine the number of partitions)
> >>>>> - KS does auto-repartitioning only for downstream DSL operators like
> >>>>> `count()` (eg, a `transform()` does never trigger an auto-repartitioning
> >>>>> even if the stream is marked as `repartitioningRequired`).
> >>>>> - KS offers `through()` to enforce a repartitioning -- however, the user
> >>>>> needs to create the topic manually (with the desired number of partitions).
> >>>>>
> >>>>> I see two main applications for `repartitioning()`:
> >>>>>
> >>>>> 1) repartition data before a `transform()` but user does not want to
> >>>>> manage the topic
> >>>>> 2) scale out a downstream subtopology
> >>>>>
> >>>>> Hence, I see `repartition()` similar to `through()`: if a users calls
> >>>>> it, a repartitining is enforced, with the difference that KS manages the
> >>>>> topic and the user does not need to create it.
> >>>>>
> >>>>> This behavior makes (1) and (2) possible.
> >>>>>
> >>>>>> I think many users would prefer to just say "if there *is* a repartition
> >>>>>> required at this point in the topology, it should
> >>>>>> have N partitions"
> >>>>>
> >>>>> Because of (2), I disagree. Either a user does not care about scaling
> >>>>> out, for which case she would not specify the number of partitions. Or a
> >>>>> user does care, and hence wants to enforce the scale out. I don't think
> >>>>> that any user would say, "maybe scale out".
> >>>>>
> >>>>> Therefore, the optimizer should never ignore the repartition operation.
> >>>>> As a "consequence" (because repartitioning is expensive) a user should
> >>>>> make an explicit call to `repartition()` IMHO -- piggybacking an
> >>>>> enforced repartitioning into `groupByKey()` seems to be "dangerous"
> >>>>> because it might be too subtle and an "optional scaling out" as laid out
> >>>>> above does not make sense IMHO.
> >>>>>
> >>>>> I am also not worried about "over repartitioning" because the result
> >>>>> stream would never trigger auto-repartitioning. Only if multiple
> >>>>> consecutive calls to `repartition()` are made it could be bad -- but
> >>>>> that's the same with `through()`. In the end, there is always some
> >>>>> responsibility on the user.
> >>>>>
> >>>>> Btw, for `.groupBy()` we know that repartitioning will be required,
> >>>>> however, for `groupByKey()` it depends if the KStream is marked as
> >>>>> `repartitioningRequired`.
> >>>>>
> >>>>> Hence, for `groupByKey()` it should not be possible for a user to set
> >>>>> number of partitions IMHO. For `groupBy()` it's a different story,
> >>>>> because calling
> >>>>>
> >>>>>   `repartition().groupBy()`
> >>>>>
> >>>>> does not achieve what we want. Hence, allowing users to pass in the
> >>>>> number of users partitions into `groupBy()` does actually makes sense,
> >>>>> because repartitioning will happen anyway and thus we can piggyback a
> >>>>> scaling decision.
> >>>>>
> >>>>> I think that John has a fair concern about the overloads, however, I am
> >>>>> not convinced that using `Grouped` to specify the number of partitions
> >>>>> is intuitive. I double checked `Grouped` and `Repartitioned` and both
> >>>>> allow to specify a `name` and `keySerde/valueSerde`. Thus, I am
> >>>>> wondering if we could bridge the gap between both, if we would make
> >>>>> `Repartitioned extends Grouped`? For this case, we only need
> >>>>> `groupBy(Grouped)` and a user can pass in both types what seems to make
> >>>>> the API quite smooth:
> >>>>>
> >>>>>  `stream.groupBy(..., Grouped...)`
> >>>>>
> >>>>>  `stream.groupBy(..., Repartitioned...)`
> >>>>>
> >>>>>
> >>>>> Thoughts?
> >>>>>
> >>>>>
> >>>>> -Matthias
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
> >>>>>> Hi Sophie,
> >>>>>>
> >>>>>> Thank you for your reply, very insightful. Looking forward hearing others opinion as well on this.
> >>>>>>
> >>>>>> Kind regards,
> >>>>>> Levani
> >>>>>>
> >>>>>>
> >>>>>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <so...@confluent.io> wrote:
> >>>>>>>
> >>>>>>>> Personally, I think Matthias’s concern is valid, but on the other hand
> >>>>>>> Kafka Streams has already
> >>>>>>>> optimizer in place which alters topology independently from user
> >>>>>>>
> >>>>>>> I agree (with you) and think this is a good way to put it -- we currently
> >>>>>>> auto-repartition for the user so
> >>>>>>> that they don't have to walk through their entire topology and reason about
> >>>>>>> when and where to place a
> >>>>>>> `.through` (or the new `.repartition`), so why suddenly force this onto the
> >>>>>>> user? How certain are we that
> >>>>>>> users will always get this right? It's easy to imagine that during
> >>>>>>> development, you write your new app with
> >>>>>>> correctly placed repartitions in order to use this new feature. During the
> >>>>>>> course of development you end up
> >>>>>>> tweaking the topology, but don't remember to review or move the
> >>>>>>> repartitioning since you're used to Streams
> >>>>>>> doing this for you. If you use only single-partition topics for testing,
> >>>>>>> you might not even notice your app is
> >>>>>>> spitting out incorrect results!
> >>>>>>>
> >>>>>>> Anyways, I feel pretty strongly that it would be weird to introduce a new
> >>>>>>> feature and say that to use it, you can't take
> >>>>>>> advantage of this other feature anymore. Also, is it possible our
> >>>>>>> optimization framework could ever include an
> >>>>>>> optimized repartitioning strategy that is better than what a user could
> >>>>>>> achieve by manually inserting repartitions?
> >>>>>>> Do we expect users to have a deep understanding of the best way to
> >>>>>>> repartition their particular topology, or is it
> >>>>>>> likely they will end up over-repartitioning either due to missed
> >>>>>>> optimizations or unnecessary extra repartitions?
> >>>>>>> I think many users would prefer to just say "if there *is* a repartition
> >>>>>>> required at this point in the topology, it should
> >>>>>>> have N partitions"
> >>>>>>>
> >>>>>>> As to the idea of adding `numberOfPartitions` to Grouped rather than
> >>>>>>> adding a new parameter to groupBy, that does seem more in line with the
> >>>>>>> current syntax so +1 from me
> >>>>>>>
> >>>>>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <le...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hello all,
> >>>>>>>>
> >>>>>>>> While https://github.com/apache/kafka/pull/7170 <
> >>>>>>>> https://github.com/apache/kafka/pull/7170> is under review and it’s
> >>>>>>>> almost done, I want to resurrect discussion about this KIP to address
> >>>>>>>> couple of concerns raised by Matthias and John.
> >>>>>>>>
> >>>>>>>> As a reminder, idea of the KIP-221 was to allow DSL users control over
> >>>>>>>> repartitioning and parallelism of sub-topologies by:
> >>>>>>>> 1) Introducing new KStream#repartition operation which is done in
> >>>>>>>> https://github.com/apache/kafka/pull/7170 <
> >>>>>>>> https://github.com/apache/kafka/pull/7170>
> >>>>>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which is planned to
> >>>>>>>> be separate PR.
> >>>>>>>>
> >>>>>>>> While all agree about general implementation and idea behind
> >>>>>>>> https://github.com/apache/kafka/pull/7170 <
> >>>>>>>> https://github.com/apache/kafka/pull/7170> PR, introducing new
> >>>>>>>> KStream#groupBy(Repartitioned) method overload raised some questions during
> >>>>>>>> the review.
> >>>>>>>> Matthias raised concern that there can be cases when user uses
> >>>>>>>> `KStream#groupBy(Repartitioned)` operation, but actual repartitioning may
> >>>>>>>> not required, thus configuration passed via `Repartitioned` would never be
> >>>>>>>> applied (Matthias, please correct me if I misinterpreted your comment).
> >>>>>>>> So instead, if user wants to control parallelism of sub-topologies, he or
> >>>>>>>> she should always use `KStream#repartition` operation before groupBy. Full
> >>>>>>>> comment can be seen here:
> >>>>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
> >>>>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
> >>>>>>>>
> >>>>>>>> On the same topic, John pointed out that, from API design perspective, we
> >>>>>>>> shouldn’t intertwine configuration classes of different operators between
> >>>>>>>> one another. So instead of introducing new `KStream#groupBy(Repartitioned)`
> >>>>>>>> for specifying number of partitions for internal topic, we should update
> >>>>>>>> existing `Grouped` class with `numberOfPartitions` field.
> >>>>>>>>
> >>>>>>>> Personally, I think Matthias’s concern is valid, but on the other hand
> >>>>>>>> Kafka Streams has already optimizer in place which alters topology
> >>>>>>>> independently from user. So maybe it makes sense if Kafka Streams,
> >>>>>>>> internally would optimize topology in the best way possible, even if in
> >>>>>>>> some cases this means ignoring some operator configurations passed by the
> >>>>>>>> user. Also, I agree with John about API design semantics. If we go through
> >>>>>>>> with the changes for `KStream#groupBy` operation, it makes more sense to
> >>>>>>>> add `numberOfPartitions` field to `Grouped` class instead of introducing
> >>>>>>>> new `KStream#groupBy(Repartitioned)` method overload.
> >>>>>>>>
> >>>>>>>> I would really appreciate communities feedback on this.
> >>>>>>>>
> >>>>>>>> Kind regards,
> >>>>>>>> Levani
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <so...@confluent.io>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hey Levani,
> >>>>>>>>>
> >>>>>>>>> I think people are busy with the upcoming 2.4 release, and don't have
> >>>>>>>> much
> >>>>>>>>> spare time at the
> >>>>>>>>> moment. It's kind of a difficult time to get attention on things, but
> >>>>>>>> feel
> >>>>>>>>> free to pick up something else
> >>>>>>>>> to work on in the meantime until things have calmed down a bit!
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>> Sophie
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
> >>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hello all,
> >>>>>>>>>>
> >>>>>>>>>> Sorry for bringing this thread again, but I would like to get some
> >>>>>>>>>> attention on this PR: https://github.com/apache/kafka/pull/7170 <
> >>>>>>>> https://github.com/apache/kafka/pull/7170> <
> >>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
> >>>>>>>> https://github.com/apache/kafka/pull/7170>>
> >>>>>>>>>> It's been a while now and I would love to move on to other KIPs as well.
> >>>>>>>>>> Please let me know if you have any concerns.
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Levani
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
> >>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>> Here’s voting thread for this KIP:
> >>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
> >>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html> <
> >>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
> >>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>>
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Levani
> >>>>>>>>>>>
> >>>>>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
> >>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Matthias,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion on that one.
> >>>>>>>>>>>> Agree that avoiding unnecessary method overloads is a good idea.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Updated KIP
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regards,
> >>>>>>>>>>>> Levani
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <matthias@confluent.io
> >>>>>>>> <ma...@confluent.io>
> >>>>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> One question:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Why do we add
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Repartitioned#with(final String name, final int numberOfPartitions)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
> >>>>>>>>>>>>> combination with `withName()` and `withNumberOfPartitions()` should
> >>>>>>>> be
> >>>>>>>>>>>>> sufficient. Users can chain the method calls.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> (I think it's valuable to keep the number of overload small if
> >>>>>>>>>> possible.)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Otherwise LGTM.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
> >>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks all for your feedback.
> >>>>>>>>>>>>>> I started voting procedure for this KIP. If there’re any other
> >>>>>>>>>> concerns about this KIP, please let me know.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
> >>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Matthias,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for the suggestion, makes sense.
> >>>>>>>>>>>>>>> I’ve updated KIP (
> >>>>>>>>>>
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>> <
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>
> >>>>>>>>>> <
> >>>>>>>>>>
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>> <
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>
> >>>>>>>>>> <
> >>>>>>>>>>
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>> <
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>
> >>>>>>>>>> <
> >>>>>>>>>>
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>> <
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>
> >>>>>>>>>>>> ).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
> >>>>>>>> matthias@confluent.io <ma...@confluent.io>
> >>>>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>> <mailto:
> >>>>>>>> matthias@confluent.io <ma...@confluent.io> <mailto:
> >>>>>>>>>> matthias@confluent.io <ma...@confluent.io>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for driving the KIP.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I agree that users need to be able to specify a partitioning
> >>>>>>>>>> strategy.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Sophie raises a fair point about topic configs and producer
> >>>>>>>>>> configs. My
> >>>>>>>>>>>>>>>> take is, that consider `Repartitioned` as an "extension" to
> >>>>>>>>>> `Produced`,
> >>>>>>>>>>>>>>>> that adds topic configuration, is a good way to think about it and
> >>>>>>>>>> helps
> >>>>>>>>>>>>>>>> to keep the API "clean".
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> With regard to method names. I would prefer to avoid
> >>>>>>>> abbreviations.
> >>>>>>>>>> Can
> >>>>>>>>>>>>>>>> we rename:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Furthermore, it might be good to add some more `static` methods:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
> >>>>>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
> >>>>>>>>>>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
> >>>>>>>>>>>>>>>>> Totally agree. I think in KStream interface it makes sense to
> >>>>>>>> have
> >>>>>>>>>> some duplicate configurations between operators in order to keep API
> >>>>>>>> simple
> >>>>>>>>>> and usable.
> >>>>>>>>>>>>>>>>> Also, as more surface API has, harder it is to have proper
> >>>>>>>>>> backward compatibility.
> >>>>>>>>>>>>>>>>> While initial idea of keeping topic level configs separate was
> >>>>>>>>>> exciting, having Repartitioned class encapsulate some producer level
> >>>>>>>>>> configs makes API more readable.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
> >>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> >>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> >>>>>>>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I think that is a good point about trying to keep producer level
> >>>>>>>>>>>>>>>>>> configurations and (repartition) topic level considerations
> >>>>>>>>>> separate.
> >>>>>>>>>>>>>>>>>> Number of partitions is definitely purely a topic level
> >>>>>>>>>> configuration. But
> >>>>>>>>>>>>>>>>>> on some level, serdes and partitioners are just as much a topic
> >>>>>>>>>>>>>>>>>> configuration as a producer one. You could have two producers
> >>>>>>>>>> configured
> >>>>>>>>>>>>>>>>>> with different serdes and/or partitioners, but if they are
> >>>>>>>>>> writing to the
> >>>>>>>>>>>>>>>>>> same topic the result would be very difficult to part. So in a
> >>>>>>>>>> sense, these
> >>>>>>>>>>>>>>>>>> are configurations of topics in Streams, not just producers.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Another way to think of it: while the Streams API is not always
> >>>>>>>>>> true to
> >>>>>>>>>>>>>>>>>> this, ideally all the relevant configs for an operator are
> >>>>>>>>>> wrapped into a
> >>>>>>>>>>>>>>>>>> single object (in this case, Repartitioned). We could instead
> >>>>>>>>>> split out the
> >>>>>>>>>>>>>>>>>> fields in common with Produced into a separate parameter to keep
> >>>>>>>>>> topic and
> >>>>>>>>>>>>>>>>>> producer level configurations separate, but this increases the
> >>>>>>>>>> API surface
> >>>>>>>>>>>>>>>>>> area by a lot. It's much more straightforward to just say "this
> >>>>>>>> is
> >>>>>>>>>>>>>>>>>> everything that this particular operator needs" without worrying
> >>>>>>>>>> about what
> >>>>>>>>>>>>>>>>>> exactly you're specifying.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I suppose you could alternatively make Produced a field of
> >>>>>>>>>> Repartitioned,
> >>>>>>>>>>>>>>>>>> but I don't think we do this kind of composition elsewhere in
> >>>>>>>>>> Streams at
> >>>>>>>>>>>>>>>>>> the moment
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
> >>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>>>>>>> levani.codes@gmail.com <ma...@gmail.com>> <mailto:
> >>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Bill,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks a lot for the feedback.
> >>>>>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
> >>>>>>>>>> `Repartitioned#partitioner`
> >>>>>>>>>>>>>>>>>>> configuration.
> >>>>>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class for topic level
> >>>>>>>>>>>>>>>>>>> configuration and keep topic level and producer level
> >>>>>>>>>> configurations (such
> >>>>>>>>>>>>>>>>>>> as Produced) separately (see my second email in this thread).
> >>>>>>>>>>>>>>>>>>> But while looking at the semantics of KStream interface, I
> >>>>>>>>>> couldn’t really
> >>>>>>>>>>>>>>>>>>> figure out good operation name for Topic level configuration
> >>>>>>>>>> class and just
> >>>>>>>>>>>>>>>>>>> introducing `Topic` config class was kinda breaking the
> >>>>>>>>>> semantics.
> >>>>>>>>>>>>>>>>>>> So I think having Repartitioned class which encapsulates topic
> >>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> producer level configurations for internal topics is viable
> >>>>>>>>>> thing to do.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <bbejeck@gmail.com
> >>>>>>>> <ma...@gmail.com>
> >>>>>>>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>> <mailto:
> >>>>>>>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
> >>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Lavani,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.  In addition to
> >>>>>>>>>> the reason
> >>>>>>>>>>>>>>>>>>>> provided by John, my reasoning is:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1. Users may want to use something other than hash-based
> >>>>>>>>>> partitioning
> >>>>>>>>>>>>>>>>>>>> 2. Users may wish to partition on something different than the
> >>>>>>>>>> key
> >>>>>>>>>>>>>>>>>>>> without having to change the key.  For example:
> >>>>>>>>>>>>>>>>>>>> 1. A combination of fields in the value in conjunction with
> >>>>>>>>>> the key
> >>>>>>>>>>>>>>>>>>>> 2. Something other than the key
> >>>>>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence
> >>>>>>>> in
> >>>>>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes sense for API
> >>>>>>>>>> consistency.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Just my  2 cents.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>> Bill
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>> <mailto:
> >>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi John,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> In my mind it makes sense.
> >>>>>>>>>>>>>>>>>>>>> If we add partitioner configuration to Repartitioned class,
> >>>>>>>>>> with the
> >>>>>>>>>>>>>>>>>>>>> combination of specifying number of partitions for internal
> >>>>>>>>>> topics, user
> >>>>>>>>>>>>>>>>>>>>> will have opportunity to ensure co-partitioning before join
> >>>>>>>>>> operation.
> >>>>>>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
> >>>>>>>>>>>>>>>>>>>>> Wondering what others think about this?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
> >>>>>>>> john@confluent.io <ma...@confluent.io>
> >>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>> <mailto:
> >>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
> >>>>>>>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally
> >>>>>>>>>> sure it
> >>>>>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar is the
> >>>>>>>> rationale
> >>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
> >>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
> >>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
> >>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hey John,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
> >>>>>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your example I would
> >>>>>>>>>> first issue
> >>>>>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner that
> >>>>>>>>>> essentially
> >>>>>>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>>>>> be the same as the topic I want to join with and then do the
> >>>>>>>>>>>>>>>>>>> KStream#join
> >>>>>>>>>>>>>>>>>>>>> with DSL?
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
> >>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:john@confluent.io
> >>>>>>>> <ma...@confluent.io>> <mailto:john@confluent.io <mailto:
> >>>>>>>> john@confluent.io>
> >>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an option to specify
> >>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may
> >>>>>>>>>> get
> >>>>>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the
> >>>>>>>>>> right-side
> >>>>>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
> >>>>>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the
> >>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>> strategy.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss something
> >>>>>>>>>> important?
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
> >>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
> >>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
> >>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m
> >>>>>>>> not
> >>>>>>>>>> sure
> >>>>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>>> it yet.
> >>>>>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would
> >>>>>>>>>> need control
> >>>>>>>>>>>>>>>>>>>>> over partitioner for internal topics.
> >>>>>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best
> >>>>>>>>>> how to
> >>>>>>>>>>>>>>>>>>>>> partition data for internal topics.
> >>>>>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for
> >>>>>>>>>> topics
> >>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>> are created by user In advance.
> >>>>>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to
> >>>>>>>>>> specify
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> partitioner.
> >>>>>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess
> >>>>>>>>>> specifying the
> >>>>>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s agreement on
> >>>>>>>> this.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
> >>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> >>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned
> >>>>>>>>>> would be a
> >>>>>>>>>>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
> >>>>>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to
> >>>>>>>>>> Produced? I'm not
> >>>>>>>>>>>>>>>>>>>>> sure how
> >>>>>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but seems it should
> >>>>>>>> be
> >>>>>>>>>>>>>>>>>>> available
> >>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>>>>> repartition topics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
> >>>>>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
> >>>>>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over
> >>>>>>>>>> the topic
> >>>>>>>>>>>>>>>>>>>>> such as
> >>>>>>>>>>>>>>>>>>>>>>>>>>> num of partitions.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that
> >>>>>>>> is
> >>>>>>>>>> missing
> >>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>> current DSL design.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by
> >>>>>>>>>> configuring
> >>>>>>>>>>>>>>>>>>> num
> >>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
> >>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> >>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me
> >>>>>>>> --
> >>>>>>>>>> for the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned,
> >>>>>>>>>> will the
> >>>>>>>>>>>>>>>>>>>>> topic be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case
> >>>>>>>> for
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> signature
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be
> >>>>>>>>>> pre-created? The
> >>>>>>>>>>>>>>>>>>>>> wording
> >>>>>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method
> >>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>> auto-create
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Sophie
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>> <
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>
> >>>>>>>>>> <
> >>>>>>>>>>
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>> <
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>
> >>>>>>>>>> <
> >>>>>>>>>>
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>> <
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>
> >>>>>>>>>> <
> >>>>>>>>>>
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>> <
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>> <
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>
> >>>>>>>>>> <
> >>>>>>>>>>
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> )
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns
> >>>>>>>> about
> >>>>>>>>>> this KIP.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated
> >>>>>>>>>> confluence
> >>>>>>>>>>>>>>>>>>> page
> >>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic`
> >>>>>>>>>> class for
> >>>>>>>>>>>>>>>>>>>>> topic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators like
> >>>>>>>> `groupedBy`.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this
> >>>>>>>> as
> >>>>>>>>>> well.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
> >>>>>>>>>>>>>>>>>>>>> matthias@confluent.io
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for
> >>>>>>>>>> summarizing
> >>>>>>>>>>>>>>>>>>>>>>>>>>> everything.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed
> >>>>>>>>>> already (can't
> >>>>>>>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to
> >>>>>>>>>> refresh the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard
> >>>>>>>> to
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> distinction
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and
> >>>>>>>> operators
> >>>>>>>>>> that use
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> user-created
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might
> >>>>>>>> indicate
> >>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>> leaving
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses
> >>>>>>>>>> use-defined
> >>>>>>>>>>>>>>>>>>>>> topics) and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an
> >>>>>>>>>> operator that
> >>>>>>>>>>>>>>>>>>>>> manages
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is
> >>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but
> >>>>>>>>>> sometimes
> >>>>>>>>>>>>>>>>>>> not; a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction
> >>>>>>>>>> clearer.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am
> >>>>>>>>>> wondering
> >>>>>>>>>>>>>>>>>>>>> if the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding
> >>>>>>>>>> it is
> >>>>>>>>>>>>>>>>>>>>> semantically
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of
> >>>>>>>>>> others on
> >>>>>>>>>>>>>>>>>>>>> this, too.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> am
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow
> >>>>>>>> to
> >>>>>>>>>>>>>>>>>>> configure
> >>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we
> >>>>>>>>>> assume that
> >>>>>>>>>>>>>>>>>>>>> a topic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method
> >>>>>>>>>> instead of
> >>>>>>>>>>>>>>>>>>>>> a plain
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other
> >>>>>>>> configs,
> >>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>> replication
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the
> >>>>>>>>>> need to
> >>>>>>>>>>>>>>>>>>>>> change the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> "main
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like
> >>>>>>>>>> them
> >>>>>>>>>>>>>>>>>>> myself.
> >>>>>>>>>>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe
> >>>>>>>> enhancing
> >>>>>>>>>> Produced
> >>>>>>>>>>>>>>>>>>>>> with num
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach.
> >>>>>>>>>> Let me
> >>>>>>>>>>>>>>>>>>>>> explain
> >>>>>>>>>>>>>>>>>>>>>>>>>>> why:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this
> >>>>>>>>>> configuration,
> >>>>>>>>>>>>>>>>>>>>> this will
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is
> >>>>>>>>>> the final
> >>>>>>>>>>>>>>>>>>>>> sink of
> >>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable
> >>>>>>>> assumption
> >>>>>>>>>> that user
> >>>>>>>>>>>>>>>>>>>>> needs
> >>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that
> >>>>>>>>>> case, having
> >>>>>>>>>>>>>>>>>>>>> num of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API
> >>>>>>>>>> contract, seems
> >>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that is
> >>>>>>>>>> explicitly for
> >>>>>>>>>>>>>>>>>>>>> producer
> >>>>>>>>>>>>>>>>>>>>>>>>>>> (key
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all
> >>>>>>>>>> are producer
> >>>>>>>>>>>>>>>>>>>>>>>>>>> specific
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
> >>>>>>>>>>>>>>>>>>>>> configuration. And
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
> >>>>>>>>>> configurations
> >>>>>>>>>>>>>>>>>>>>> together in
> >>>>>>>>>>>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like
> >>>>>>>>>> Produced
> >>>>>>>>>>>>>>>>>>>>> parameter is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g
> >>>>>>>> topics
> >>>>>>>>>> created
> >>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>> managed
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we
> >>>>>>>>>> should leave
> >>>>>>>>>>>>>>>>>>>>> it as
> >>>>>>>>>>>>>>>>>>>>>>>>>>> it is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in that case.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we
> >>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>> distinguish
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should
> >>>>>>>>>> create and
> >>>>>>>>>>>>>>>>>>>>> manage
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that
> >>>>>>>>>> should be
> >>>>>>>>>>>>>>>>>>>>> created in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions
> >>>>>>>>>> configuration in
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Produced
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level
> >>>>>>>>>> configuration in
> >>>>>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>>>>>>> class
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in
> >>>>>>>>>> KStream#through - I
> >>>>>>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users
> >>>>>>>>>> possibility to
> >>>>>>>>>>>>>>>>>>>>> specify
> >>>>>>>>>>>>>>>>>>>>>>>>>>> num
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :)
> >>>>>>>> Considering
> >>>>>>>>>> arguments
> >>>>>>>>>>>>>>>>>>>>> against
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two
> >>>>>>>>>> options
> >>>>>>>>>>>>>>>>>>>>> here:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
> >>>>>>>>>> num of
> >>>>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
> >>>>>>>> will
> >>>>>>>>>> be auto
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
> >>>>>>>>>> Produced<K, V>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
> >>>>>>>>>> specified num of
> >>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce
> >>>>>>>>>> new method
> >>>>>>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this
> >>>>>>>>>> in one of
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>> threads)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the
> >>>>>>>>>> following
> >>>>>>>>>>>>>>>>>>> plan:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
> >>>>>>>> Grouped
> >>>>>>>>>> class (as
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
> >>>>>>>>>> KStream#through
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
> >>>>>>>>>> num of
> >>>>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
> >>>>>>>> will
> >>>>>>>>>> be auto
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
> >>>>>>>>>> Produced<K, V>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
> >>>>>>>>>> specified num of
> >>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
> >>>>>>>> Grouped
> >>>>>>>>>> class (as
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for
> >>>>>>>>>> creating and
> >>>>>>>>>>>>>>>>>>>>> managing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already
> >>>>>>>>>> discussed in the
> >>>>>>>>>>>>>>>>>>>>> mailing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were
> >>>>>>>>>> about this
> >>>>>>>>>>>>>>>>>>>>> KIP :(
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> >>>>>>>> levani.codes@gmail.com>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around
> >>>>>>>>>> KIP-221. Going
> >>>>>>>>>>>>>>>>>>>>> through
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement
> >>>>>>>>>> around
> >>>>>>>>>>>>>>>>>>>>> usefulness of
> >>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> feature.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I
> >>>>>>>>>> understood, the
> >>>>>>>>>>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through
> >>>>>>>>>> method
> >>>>>>>>>>>>>>>>>>>>> (essentially
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
> >>>>>>>>>>>>>>>>>>> configuration
> >>>>>>>>>>>>>>>>>>>>>>>>>>> field.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to
> >>>>>>>> control
> >>>>>>>>>>>>>>>>>>>>> parallelism and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful
> >>>>>>>> operations.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
> >>>>>>>>>>>>>>>>>>>>> KStream#through if
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >
> >
>

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by "Matthias J. Sax" <ma...@confluent.io>.
Thanks a lot John. I think the way you decompose the operators is super
helpful for this discussion.

What you suggest with regard to using `Grouped` and enforcing
repartitioning if the number of partitions is specified is certainly
possible. However, I am not sure if we _should_ do this. My reasoning is
that an enforce repartitioning as introduced via `repartition()` is an
expensive operations, and it seems better to demand an more explicit
user opt-in to trigger it. Just setting an optional parameter might be
too subtle to trigger such a heavy "side effect".

While I agree about "usability" in general, I would prefer a more
conservative appraoch to introduce this feature, see how it goes, and
maybe make it more advance later on. This also applies to what
optimzation we may or may not allow (or are able to perform at all).

@Levani: Reflecting about my suggestion about `Repartioned extends
Grouped`, I agree that it might not be a good idea.

Atm, I see an enforces repartitioning as non-optimizable and as a good
first step and I would suggest to not intoruce anything else for now.
Introducing optimizable enforce repartitioning via `groupBy(...,
Grouped)` is something we could add later.


Therefore, I would not change `Grouped` but only introduce
`repartition()`. Users that use `grouBy()` atm, and want to opt-in to
set the number of partitions, would need to rewrite their code to
`selectKey(...).repartition(...).groupByKey()`. It's less convinient but
also less risky from an API and optimization point of view.


@Levani: about joins -> yes, we will need to check the specified number
of partitions (if any) and if they don't match, throw an exception. We
can discuss this on the PR -- I am just trying to get the PR for KIP-466
merged -- your is next on the list :)


Thoughts?


-Matthias


On 11/12/19 4:51 PM, Levani Kokhreidze wrote:
> Thank you all for an interesting discussion. This is very enlightening.
> 
> Thank you Matthias for your explanation. Your arguments are very true. It makes sense that if user specifies number of partitions he/she really cares that those specifications are applied to internal topics.
> Unfortunately, in current implementation this is not true during `join` operation. As I’ve written in the PR comment, currently, when `Stream#join` is used, `CopartitionedTopicsEnforcer` chooses max number of partitions from the two source topics.
> I’m not really sure what would be the other way around this situation. Maybe fail the stream altogether and inform the user to specify same number of partitions? 
> Or we should treat join operations in a same way as it is right now and basically choose max number of partitions even when `repartition` operation is specified, because Kafka Streams “knows the best” how to handle joins?
> You can check integration tests how it’s being handled currently. Open to suggestions on that part.
> 
> As for groupBy, I agree and John raised very interesting points. My arguments for allowing users to specify number of partitions during groupBy operations mainly was coming from the usability perspective.
> So building on top of what John said, maybe it makes sense to make `groupBy` operations smarter and whenever user specifies `numberOfPartitions` configuration, repartitioning will be enforced, wdyt?
> I’m not going into optimization part yet :) I think it will be part of separate PR and task, but overall it makes sense to apply optimizations where number of partitions are the same.
> 
> As for Repartitioned extending Grouped, I kinda feel that it won’t fit nicely in current API design. 
> In addition, in the PR review, John mentioned that there were a lot of troubles in the past trying to use one operation's configuration objects on other operations. 
> Also it makes sense to keep them separate in terms of compatibility. 
> In that case, we don’t have to worry every time Grouped is changed, what would be the implications on `repartition` operations.
> 
> Kind regards,
> Levani
> 
> 
>> On Nov 11, 2019, at 9:13 PM, John Roesler <jo...@confluent.io> wrote:
>>
>> Ah, thanks for the clarification. I missed your point.
>>
>> I like the framework you've presented. It does seem simpler to assume
>> that they either care about the partition count and want to
>> repartition to realize it, or they don't care about the number.
>> Returning to this discussion, it does seem unlikely that they care
>> about the number and _don't_ care if it actually gets realized.
>>
>> But then, it still seems like we can just keep the option as part of
>> Grouped. As in:
>>
>> // user does not care
>> stream.groupByKey(Grouped /*not specifying partition count*/)
>> stream.groupBy(Grouped /*not specifying partition count*/)
>>
>> // user does care
>> stream.repartition(Repartitioned)
>> stream.groupByKey(Grouped.numberOfPartitions(...))
>> stream.groupBy(Grouped.numberOfPartitions(...))
>>
>> ----
>>
>> The above discussion got me thinking about algebra. Matthias is
>> absolutely right that `groupByKey(numPartitions)` is equivalent to
>> `repartition(numPartitions).groupByKey()`. I'm just not convinced that
>> we should force people to apply that expansion themselves vs. having a
>> more compact way to express it if they don't care where exactly the
>> repartition occurs. However, thinking about these operators
>> algebraically can really help *us* narrow down the number of different
>> expressions we have to consider.
>>
>> Let's consider some identities:
>>
>> A: groupBy(mapper) + agg = mapKey(mapper) + groupByKey + agg
>> B: src + ... + groupByKey + agg = src + ... + passthough + agg
>> C: mapKey(mapper) + ... + groupByKey + agg
>> = mapKey(mapper) + ... + repartition + groupByKey + agg
>> D: repartition = sink(managed) + src
>>
>> In these identities, I used one special identifier (...), which means
>> any number (0+) of operations that are not src, mapKey, groupBy[Key],
>> repartition, or agg.
>>
>> For mental clarity, I'm just going to make up a rule that groupBy
>> operations are not executable. In other words, you have to get to a
>> point where you can apply B to convert a groupByKey into a passthough
>> in order to execute the program. This is just a formal way of stating
>> what already happens in Kafka Streams.
>>
>> By applying A, we can just completely leave `groupBy` out of our
>> analysis. It trivially decomposes into a mapKey followed by a
>> groupByKey.
>>
>> Then, we can eliminate the "repartition required" case of `groupByKey`
>> by applying C followed by D to get to the "no repartition required"
>> version of groupByKey, which in turn sets us up to apply B to get an
>> executable topology.
>>
>> Fundamentally, you can think about KIP-221 is as proposing a modified
>> D identity in which you can specify the partition count of the managed
>> sink topic:
>> D': repartition(pc) = sink(managed w/ pc) + src
>>
>> Since users _could_ apply the identities above, we don't actually have
>> to add any partition count to groupBy[Key], but we decided early on in
>> the KIP discussion that it's more ergonomic to add it. In that case,
>> we also have to modify A and C:
>> A': groupBy(mapper, pc) + agg
>> = mapKey(mapper) + groupByKey(pc) + agg
>> C': mapKey(mapper) + ... + groupByKey(pc) + agg
>> = mapKey(mapper) + ... + repartition(pc) + groupByKey + agg
>>
>> Which sets us up still to always be able to get back to a plain
>> `groupByKey` operation (with no `(pc)`) and then apply D' and
>> ultimately B to get an executable topology.
>>
>> What about the optimizer?
>> The optimizer applies another set of graph-algebraic identities to
>> minimize the number of repartition topics in a topology.
>>
>> (forgive my ascii art)
>>
>> E: (merging repartition nodes)
>> (...) -> repartition -> X
>>  \-> repartition -> Y
>> =
>> (... + repartition) -> X
>>     \-> Y
>> F: (reordering around repartition)
>> Where SVO is any non-key-changing, stateless, operation:
>> repartition -> SVO = SVO -> repartition
>>
>> In terms of these identities, what the optimizer does is apply F
>> repeatedly in either direction to a topology to factor out common in
>> branches so that it can apply E to merge repartition nodes. This was
>> especially necessary before KIP-221 because you couldn't directly
>> express `repartition` in the DSL, only indirectly via `groupBy[Key]`,
>> so there was no way to do the factoring manually.
>>
>> We can now state very clearly that in KIP-221, explicit
>> `repartition()` operators should create a "reordering barrier". So, F
>> cannot be applied to an explicit `repartition()`. Also, I think we
>> decided earlier that explicit `repartition()` operations would also be
>> ineligible for merging, so E can't be applied to explicit
>> `repartition()` operations either. I think we feel we _could_ apply E
>> without harm, but we want to be conservative for now.
>>
>> I think the salient point from the latter discussion has been that
>> when you use `Grouped.numberOfPartitions`, this does _not_ constitute
>> an explicit `repartition()` operator, and therefore, the resulting
>> repartition node remains eligible for optimization.
>>
>> To be clear, I agree with Matthias that the provided partition count
>> _must_ be used in the resulting implicit repartition. This has some
>> implications for E. Namely, E could only be applied to two repartition
>> nodes that have the same partition count. This has always been
>> trivially true before KIP-221 because the partition count has always
>> been "unspecified", i.e., it would be determined at runtime by the
>> user-managed-topics' partition counts. Now, it could be specified or
>> unspecified. We can simply augment E to allow merging only repartition
>> nodes where the partition count is EITHER "specified and the same on
>> both sides", OR "unspecified on both sides".
>>
>> Sorry for the long email, but I have a hope that it builds a solid
>> theoretical foundation for our decisions in KIP-221, so we can have
>> confidence that there are no edge cases for design flaws to hide.
>>
>> Thanks,
>> -John
>>
>> On Sat, Nov 9, 2019 at 10:37 PM Matthias J. Sax <ma...@confluent.io> wrote:
>>>
>>>> it seems like we do want to allow
>>>>> people to optionally specify a partition count as part of this
>>>>> operation, but we don't want that option to _force_ repartitioning
>>>
>>> Correct, ie, that is my suggestions.
>>>
>>>> "Use P partitions if repartitioning is necessary"
>>>
>>> I disagree here, because my reasoning is that:
>>>
>>> - if a user cares about the number of partition, the user wants to
>>> enforce a repartitioning
>>> - if a user does not case about the number of partitions, we don't need
>>> to provide them a way to pass in a "hint"
>>>
>>> Hence, it should be sufficient to support:
>>>
>>> // user does not care
>>>
>>>  `stream.groupByKey(Grouped)`
>>>  `stream.grouBy(..., Grouped)`
>>>
>>> // user does care
>>>
>>>  `stream.repartition(Repartitioned).groupByKey()`
>>>  `streams.groupBy(..., Repartitioned)`
>>>
>>>
>>>
>>> -Matthias
>>>
>>>
>>> On 11/9/19 8:10 PM, John Roesler wrote:
>>>> Thanks for those thoughts, Matthias,
>>>>
>>>> I find your reasoning about the optimization behavior compelling. The
>>>> `through` operation is very simple and clear to reason about. It just
>>>> passes the data exactly at the specified point in the topology exactly
>>>> through the specified topic. Likewise, if a user invokes a
>>>> `repartition` operator, the simplest behavior is if we just do what
>>>> they asked for.
>>>>
>>>> Stepping back to think about when optimizations are surprising and
>>>> when they aren't, it occurs to me that we should be free to move
>>>> around repartitions when users have asked to perform some operation
>>>> that implies a repartition, like "change keys, then filter, then
>>>> aggregate". This program requires a repartition, but it could be
>>>> anywhere between the key change and the aggregation. On the other
>>>> hand, if they say, "change keys, then filter, then repartition, then
>>>> aggregate", it seems like they were pretty clear about their desire,
>>>> and we should just take it at face value.
>>>>
>>>> So, I'm sold on just literally doing a repartition every time they
>>>> invoke the `repartition` operator.
>>>>
>>>>
>>>> The "partition count" modifier for `groupBy`/`groupByKey` is more nuanced.
>>>>
>>>> What you said about `groupByKey` makes sense. If they key hasn't
>>>> actually changed, then we don't need to repartition before
>>>> aggregating. On the other hand, `groupBy` is specifically changing the
>>>> key as part of the grouping operation, so (as you said) we definitely
>>>> have to do a repartition.
>>>>
>>>> If I'm reading the discussion right, it seems like we do want to allow
>>>> people to optionally specify a partition count as part of this
>>>> operation, but we don't want that option to _force_ repartitioning if
>>>> it's not needed. That last clause is the key. "Use P partitions if
>>>> repartitioning is necessary" is a directive that applies cleanly and
>>>> correctly to both `groupBy` and `groupByKey`. What if we call the
>>>> option `numberOfPartitionsHint`, which along with the "if necessary"
>>>> javadoc, should make it clear that the option won't force a
>>>> repartition, and also gives us enough latitude to still employ the
>>>> optimizer on those repartition topics?
>>>>
>>>> If we like the idea of expressing it as a "hint" for grouping and a
>>>> "command" for `repartition`, then it seems like it still makes sense
>>>> to keep Grouped and Repartitioned separate, as they would actually
>>>> offer different methods with distinct semantics.
>>>>
>>>> WDYT?
>>>>
>>>> Thanks,
>>>> -John
>>>>
>>>> On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <ma...@confluent.io> wrote:
>>>>>
>>>>> Sorry for late reply.
>>>>>
>>>>> I guess, the question boils down to the intended semantics of
>>>>> `repartition()`. My understanding is as follows:
>>>>>
>>>>> - KS does auto-repartitioning for correctness reasons (using the
>>>>> upstream topic to determine the number of partitions)
>>>>> - KS does auto-repartitioning only for downstream DSL operators like
>>>>> `count()` (eg, a `transform()` does never trigger an auto-repartitioning
>>>>> even if the stream is marked as `repartitioningRequired`).
>>>>> - KS offers `through()` to enforce a repartitioning -- however, the user
>>>>> needs to create the topic manually (with the desired number of partitions).
>>>>>
>>>>> I see two main applications for `repartitioning()`:
>>>>>
>>>>> 1) repartition data before a `transform()` but user does not want to
>>>>> manage the topic
>>>>> 2) scale out a downstream subtopology
>>>>>
>>>>> Hence, I see `repartition()` similar to `through()`: if a users calls
>>>>> it, a repartitining is enforced, with the difference that KS manages the
>>>>> topic and the user does not need to create it.
>>>>>
>>>>> This behavior makes (1) and (2) possible.
>>>>>
>>>>>> I think many users would prefer to just say "if there *is* a repartition
>>>>>> required at this point in the topology, it should
>>>>>> have N partitions"
>>>>>
>>>>> Because of (2), I disagree. Either a user does not care about scaling
>>>>> out, for which case she would not specify the number of partitions. Or a
>>>>> user does care, and hence wants to enforce the scale out. I don't think
>>>>> that any user would say, "maybe scale out".
>>>>>
>>>>> Therefore, the optimizer should never ignore the repartition operation.
>>>>> As a "consequence" (because repartitioning is expensive) a user should
>>>>> make an explicit call to `repartition()` IMHO -- piggybacking an
>>>>> enforced repartitioning into `groupByKey()` seems to be "dangerous"
>>>>> because it might be too subtle and an "optional scaling out" as laid out
>>>>> above does not make sense IMHO.
>>>>>
>>>>> I am also not worried about "over repartitioning" because the result
>>>>> stream would never trigger auto-repartitioning. Only if multiple
>>>>> consecutive calls to `repartition()` are made it could be bad -- but
>>>>> that's the same with `through()`. In the end, there is always some
>>>>> responsibility on the user.
>>>>>
>>>>> Btw, for `.groupBy()` we know that repartitioning will be required,
>>>>> however, for `groupByKey()` it depends if the KStream is marked as
>>>>> `repartitioningRequired`.
>>>>>
>>>>> Hence, for `groupByKey()` it should not be possible for a user to set
>>>>> number of partitions IMHO. For `groupBy()` it's a different story,
>>>>> because calling
>>>>>
>>>>>   `repartition().groupBy()`
>>>>>
>>>>> does not achieve what we want. Hence, allowing users to pass in the
>>>>> number of users partitions into `groupBy()` does actually makes sense,
>>>>> because repartitioning will happen anyway and thus we can piggyback a
>>>>> scaling decision.
>>>>>
>>>>> I think that John has a fair concern about the overloads, however, I am
>>>>> not convinced that using `Grouped` to specify the number of partitions
>>>>> is intuitive. I double checked `Grouped` and `Repartitioned` and both
>>>>> allow to specify a `name` and `keySerde/valueSerde`. Thus, I am
>>>>> wondering if we could bridge the gap between both, if we would make
>>>>> `Repartitioned extends Grouped`? For this case, we only need
>>>>> `groupBy(Grouped)` and a user can pass in both types what seems to make
>>>>> the API quite smooth:
>>>>>
>>>>>  `stream.groupBy(..., Grouped...)`
>>>>>
>>>>>  `stream.groupBy(..., Repartitioned...)`
>>>>>
>>>>>
>>>>> Thoughts?
>>>>>
>>>>>
>>>>> -Matthias
>>>>>
>>>>>
>>>>>
>>>>> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
>>>>>> Hi Sophie,
>>>>>>
>>>>>> Thank you for your reply, very insightful. Looking forward hearing others opinion as well on this.
>>>>>>
>>>>>> Kind regards,
>>>>>> Levani
>>>>>>
>>>>>>
>>>>>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <so...@confluent.io> wrote:
>>>>>>>
>>>>>>>> Personally, I think Matthias’s concern is valid, but on the other hand
>>>>>>> Kafka Streams has already
>>>>>>>> optimizer in place which alters topology independently from user
>>>>>>>
>>>>>>> I agree (with you) and think this is a good way to put it -- we currently
>>>>>>> auto-repartition for the user so
>>>>>>> that they don't have to walk through their entire topology and reason about
>>>>>>> when and where to place a
>>>>>>> `.through` (or the new `.repartition`), so why suddenly force this onto the
>>>>>>> user? How certain are we that
>>>>>>> users will always get this right? It's easy to imagine that during
>>>>>>> development, you write your new app with
>>>>>>> correctly placed repartitions in order to use this new feature. During the
>>>>>>> course of development you end up
>>>>>>> tweaking the topology, but don't remember to review or move the
>>>>>>> repartitioning since you're used to Streams
>>>>>>> doing this for you. If you use only single-partition topics for testing,
>>>>>>> you might not even notice your app is
>>>>>>> spitting out incorrect results!
>>>>>>>
>>>>>>> Anyways, I feel pretty strongly that it would be weird to introduce a new
>>>>>>> feature and say that to use it, you can't take
>>>>>>> advantage of this other feature anymore. Also, is it possible our
>>>>>>> optimization framework could ever include an
>>>>>>> optimized repartitioning strategy that is better than what a user could
>>>>>>> achieve by manually inserting repartitions?
>>>>>>> Do we expect users to have a deep understanding of the best way to
>>>>>>> repartition their particular topology, or is it
>>>>>>> likely they will end up over-repartitioning either due to missed
>>>>>>> optimizations or unnecessary extra repartitions?
>>>>>>> I think many users would prefer to just say "if there *is* a repartition
>>>>>>> required at this point in the topology, it should
>>>>>>> have N partitions"
>>>>>>>
>>>>>>> As to the idea of adding `numberOfPartitions` to Grouped rather than
>>>>>>> adding a new parameter to groupBy, that does seem more in line with the
>>>>>>> current syntax so +1 from me
>>>>>>>
>>>>>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <le...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello all,
>>>>>>>>
>>>>>>>> While https://github.com/apache/kafka/pull/7170 <
>>>>>>>> https://github.com/apache/kafka/pull/7170> is under review and it’s
>>>>>>>> almost done, I want to resurrect discussion about this KIP to address
>>>>>>>> couple of concerns raised by Matthias and John.
>>>>>>>>
>>>>>>>> As a reminder, idea of the KIP-221 was to allow DSL users control over
>>>>>>>> repartitioning and parallelism of sub-topologies by:
>>>>>>>> 1) Introducing new KStream#repartition operation which is done in
>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>> https://github.com/apache/kafka/pull/7170>
>>>>>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which is planned to
>>>>>>>> be separate PR.
>>>>>>>>
>>>>>>>> While all agree about general implementation and idea behind
>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>> https://github.com/apache/kafka/pull/7170> PR, introducing new
>>>>>>>> KStream#groupBy(Repartitioned) method overload raised some questions during
>>>>>>>> the review.
>>>>>>>> Matthias raised concern that there can be cases when user uses
>>>>>>>> `KStream#groupBy(Repartitioned)` operation, but actual repartitioning may
>>>>>>>> not required, thus configuration passed via `Repartitioned` would never be
>>>>>>>> applied (Matthias, please correct me if I misinterpreted your comment).
>>>>>>>> So instead, if user wants to control parallelism of sub-topologies, he or
>>>>>>>> she should always use `KStream#repartition` operation before groupBy. Full
>>>>>>>> comment can be seen here:
>>>>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
>>>>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
>>>>>>>>
>>>>>>>> On the same topic, John pointed out that, from API design perspective, we
>>>>>>>> shouldn’t intertwine configuration classes of different operators between
>>>>>>>> one another. So instead of introducing new `KStream#groupBy(Repartitioned)`
>>>>>>>> for specifying number of partitions for internal topic, we should update
>>>>>>>> existing `Grouped` class with `numberOfPartitions` field.
>>>>>>>>
>>>>>>>> Personally, I think Matthias’s concern is valid, but on the other hand
>>>>>>>> Kafka Streams has already optimizer in place which alters topology
>>>>>>>> independently from user. So maybe it makes sense if Kafka Streams,
>>>>>>>> internally would optimize topology in the best way possible, even if in
>>>>>>>> some cases this means ignoring some operator configurations passed by the
>>>>>>>> user. Also, I agree with John about API design semantics. If we go through
>>>>>>>> with the changes for `KStream#groupBy` operation, it makes more sense to
>>>>>>>> add `numberOfPartitions` field to `Grouped` class instead of introducing
>>>>>>>> new `KStream#groupBy(Repartitioned)` method overload.
>>>>>>>>
>>>>>>>> I would really appreciate communities feedback on this.
>>>>>>>>
>>>>>>>> Kind regards,
>>>>>>>> Levani
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <so...@confluent.io>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hey Levani,
>>>>>>>>>
>>>>>>>>> I think people are busy with the upcoming 2.4 release, and don't have
>>>>>>>> much
>>>>>>>>> spare time at the
>>>>>>>>> moment. It's kind of a difficult time to get attention on things, but
>>>>>>>> feel
>>>>>>>>> free to pick up something else
>>>>>>>>> to work on in the meantime until things have calmed down a bit!
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Sophie
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello all,
>>>>>>>>>>
>>>>>>>>>> Sorry for bringing this thread again, but I would like to get some
>>>>>>>>>> attention on this PR: https://github.com/apache/kafka/pull/7170 <
>>>>>>>> https://github.com/apache/kafka/pull/7170> <
>>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>>> https://github.com/apache/kafka/pull/7170>>
>>>>>>>>>> It's been a while now and I would love to move on to other KIPs as well.
>>>>>>>>>> Please let me know if you have any concerns.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Levani
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> Here’s voting thread for this KIP:
>>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html> <
>>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>>
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Levani
>>>>>>>>>>>
>>>>>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion on that one.
>>>>>>>>>>>> Agree that avoiding unnecessary method overloads is a good idea.
>>>>>>>>>>>>
>>>>>>>>>>>> Updated KIP
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Levani
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <matthias@confluent.io
>>>>>>>> <ma...@confluent.io>
>>>>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> One question:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Why do we add
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Repartitioned#with(final String name, final int numberOfPartitions)
>>>>>>>>>>>>>
>>>>>>>>>>>>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
>>>>>>>>>>>>> combination with `withName()` and `withNumberOfPartitions()` should
>>>>>>>> be
>>>>>>>>>>>>> sufficient. Users can chain the method calls.
>>>>>>>>>>>>>
>>>>>>>>>>>>> (I think it's valuable to keep the number of overload small if
>>>>>>>>>> possible.)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Otherwise LGTM.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks all for your feedback.
>>>>>>>>>>>>>> I started voting procedure for this KIP. If there’re any other
>>>>>>>>>> concerns about this KIP, please let me know.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the suggestion, makes sense.
>>>>>>>>>>>>>>> I’ve updated KIP (
>>>>>>>>>>
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>> <
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>
>>>>>>>>>> <
>>>>>>>>>>
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>> <
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>
>>>>>>>>>> <
>>>>>>>>>>
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>> <
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>
>>>>>>>>>> <
>>>>>>>>>>
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>> <
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>
>>>>>>>>>>>> ).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
>>>>>>>> matthias@confluent.io <ma...@confluent.io>
>>>>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>> matthias@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>> matthias@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for driving the KIP.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree that users need to be able to specify a partitioning
>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sophie raises a fair point about topic configs and producer
>>>>>>>>>> configs. My
>>>>>>>>>>>>>>>> take is, that consider `Repartitioned` as an "extension" to
>>>>>>>>>> `Produced`,
>>>>>>>>>>>>>>>> that adds topic configuration, is a good way to think about it and
>>>>>>>>>> helps
>>>>>>>>>>>>>>>> to keep the API "clean".
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> With regard to method names. I would prefer to avoid
>>>>>>>> abbreviations.
>>>>>>>>>> Can
>>>>>>>>>>>>>>>> we rename:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Furthermore, it might be good to add some more `static` methods:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>>>>>>>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>> Totally agree. I think in KStream interface it makes sense to
>>>>>>>> have
>>>>>>>>>> some duplicate configurations between operators in order to keep API
>>>>>>>> simple
>>>>>>>>>> and usable.
>>>>>>>>>>>>>>>>> Also, as more surface API has, harder it is to have proper
>>>>>>>>>> backward compatibility.
>>>>>>>>>>>>>>>>> While initial idea of keeping topic level configs separate was
>>>>>>>>>> exciting, having Repartitioned class encapsulate some producer level
>>>>>>>>>> configs makes API more readable.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I think that is a good point about trying to keep producer level
>>>>>>>>>>>>>>>>>> configurations and (repartition) topic level considerations
>>>>>>>>>> separate.
>>>>>>>>>>>>>>>>>> Number of partitions is definitely purely a topic level
>>>>>>>>>> configuration. But
>>>>>>>>>>>>>>>>>> on some level, serdes and partitioners are just as much a topic
>>>>>>>>>>>>>>>>>> configuration as a producer one. You could have two producers
>>>>>>>>>> configured
>>>>>>>>>>>>>>>>>> with different serdes and/or partitioners, but if they are
>>>>>>>>>> writing to the
>>>>>>>>>>>>>>>>>> same topic the result would be very difficult to part. So in a
>>>>>>>>>> sense, these
>>>>>>>>>>>>>>>>>> are configurations of topics in Streams, not just producers.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Another way to think of it: while the Streams API is not always
>>>>>>>>>> true to
>>>>>>>>>>>>>>>>>> this, ideally all the relevant configs for an operator are
>>>>>>>>>> wrapped into a
>>>>>>>>>>>>>>>>>> single object (in this case, Repartitioned). We could instead
>>>>>>>>>> split out the
>>>>>>>>>>>>>>>>>> fields in common with Produced into a separate parameter to keep
>>>>>>>>>> topic and
>>>>>>>>>>>>>>>>>> producer level configurations separate, but this increases the
>>>>>>>>>> API surface
>>>>>>>>>>>>>>>>>> area by a lot. It's much more straightforward to just say "this
>>>>>>>> is
>>>>>>>>>>>>>>>>>> everything that this particular operator needs" without worrying
>>>>>>>>>> about what
>>>>>>>>>>>>>>>>>> exactly you're specifying.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I suppose you could alternatively make Produced a field of
>>>>>>>>>> Repartitioned,
>>>>>>>>>>>>>>>>>> but I don't think we do this kind of composition elsewhere in
>>>>>>>>>> Streams at
>>>>>>>>>>>>>>>>>> the moment
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>> <mailto:
>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Bill,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
>>>>>>>>>> `Repartitioned#partitioner`
>>>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class for topic level
>>>>>>>>>>>>>>>>>>> configuration and keep topic level and producer level
>>>>>>>>>> configurations (such
>>>>>>>>>>>>>>>>>>> as Produced) separately (see my second email in this thread).
>>>>>>>>>>>>>>>>>>> But while looking at the semantics of KStream interface, I
>>>>>>>>>> couldn’t really
>>>>>>>>>>>>>>>>>>> figure out good operation name for Topic level configuration
>>>>>>>>>> class and just
>>>>>>>>>>>>>>>>>>> introducing `Topic` config class was kinda breaking the
>>>>>>>>>> semantics.
>>>>>>>>>>>>>>>>>>> So I think having Repartitioned class which encapsulates topic
>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> producer level configurations for internal topics is viable
>>>>>>>>>> thing to do.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <bbejeck@gmail.com
>>>>>>>> <ma...@gmail.com>
>>>>>>>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>> <mailto:
>>>>>>>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Lavani,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.  In addition to
>>>>>>>>>> the reason
>>>>>>>>>>>>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 1. Users may want to use something other than hash-based
>>>>>>>>>> partitioning
>>>>>>>>>>>>>>>>>>>> 2. Users may wish to partition on something different than the
>>>>>>>>>> key
>>>>>>>>>>>>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>>>>>>>>>>>> 1. A combination of fields in the value in conjunction with
>>>>>>>>>> the key
>>>>>>>>>>>>>>>>>>>> 2. Something other than the key
>>>>>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence
>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes sense for API
>>>>>>>>>> consistency.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Just my  2 cents.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>> <mailto:
>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>>>>>>>>>>>>> If we add partitioner configuration to Repartitioned class,
>>>>>>>>>> with the
>>>>>>>>>>>>>>>>>>>>> combination of specifying number of partitions for internal
>>>>>>>>>> topics, user
>>>>>>>>>>>>>>>>>>>>> will have opportunity to ensure co-partitioning before join
>>>>>>>>>> operation.
>>>>>>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>>>>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
>>>>>>>> john@confluent.io <ma...@confluent.io>
>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally
>>>>>>>>>> sure it
>>>>>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar is the
>>>>>>>> rationale
>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hey John,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your example I would
>>>>>>>>>> first issue
>>>>>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner that
>>>>>>>>>> essentially
>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>> be the same as the topic I want to join with and then do the
>>>>>>>>>>>>>>>>>>> KStream#join
>>>>>>>>>>>>>>>>>>>>> with DSL?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
>>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:john@confluent.io
>>>>>>>> <ma...@confluent.io>> <mailto:john@confluent.io <mailto:
>>>>>>>> john@confluent.io>
>>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an option to specify
>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may
>>>>>>>>>> get
>>>>>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the
>>>>>>>>>> right-side
>>>>>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
>>>>>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the
>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss something
>>>>>>>>>> important?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
>>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m
>>>>>>>> not
>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would
>>>>>>>>>> need control
>>>>>>>>>>>>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best
>>>>>>>>>> how to
>>>>>>>>>>>>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for
>>>>>>>>>> topics
>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to
>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> partitioner.
>>>>>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess
>>>>>>>>>> specifying the
>>>>>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s agreement on
>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned
>>>>>>>>>> would be a
>>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
>>>>>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to
>>>>>>>>>> Produced? I'm not
>>>>>>>>>>>>>>>>>>>>> sure how
>>>>>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but seems it should
>>>>>>>> be
>>>>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over
>>>>>>>>>> the topic
>>>>>>>>>>>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that
>>>>>>>> is
>>>>>>>>>> missing
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by
>>>>>>>>>> configuring
>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me
>>>>>>>> --
>>>>>>>>>> for the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned,
>>>>>>>>>> will the
>>>>>>>>>>>>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case
>>>>>>>> for
>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> signature
>>>>>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be
>>>>>>>>>> pre-created? The
>>>>>>>>>>>>>>>>>>>>> wording
>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method
>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>> <
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>
>>>>>>>>>> <
>>>>>>>>>>
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>> <
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>
>>>>>>>>>> <
>>>>>>>>>>
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>> <
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>
>>>>>>>>>> <
>>>>>>>>>>
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>> <
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>> <
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>
>>>>>>>>>> <
>>>>>>>>>>
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns
>>>>>>>> about
>>>>>>>>>> this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated
>>>>>>>>>> confluence
>>>>>>>>>>>>>>>>>>> page
>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic`
>>>>>>>>>> class for
>>>>>>>>>>>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators like
>>>>>>>> `groupedBy`.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this
>>>>>>>> as
>>>>>>>>>> well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
>>>>>>>>>>>>>>>>>>>>> matthias@confluent.io
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for
>>>>>>>>>> summarizing
>>>>>>>>>>>>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed
>>>>>>>>>> already (can't
>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to
>>>>>>>>>> refresh the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard
>>>>>>>> to
>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and
>>>>>>>> operators
>>>>>>>>>> that use
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might
>>>>>>>> indicate
>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses
>>>>>>>>>> use-defined
>>>>>>>>>>>>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an
>>>>>>>>>> operator that
>>>>>>>>>>>>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is
>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but
>>>>>>>>>> sometimes
>>>>>>>>>>>>>>>>>>> not; a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction
>>>>>>>>>> clearer.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am
>>>>>>>>>> wondering
>>>>>>>>>>>>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding
>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of
>>>>>>>>>> others on
>>>>>>>>>>>>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow
>>>>>>>> to
>>>>>>>>>>>>>>>>>>> configure
>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we
>>>>>>>>>> assume that
>>>>>>>>>>>>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method
>>>>>>>>>> instead of
>>>>>>>>>>>>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other
>>>>>>>> configs,
>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the
>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like
>>>>>>>>>> them
>>>>>>>>>>>>>>>>>>> myself.
>>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe
>>>>>>>> enhancing
>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach.
>>>>>>>>>> Let me
>>>>>>>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this
>>>>>>>>>> configuration,
>>>>>>>>>>>>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is
>>>>>>>>>> the final
>>>>>>>>>>>>>>>>>>>>> sink of
>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable
>>>>>>>> assumption
>>>>>>>>>> that user
>>>>>>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that
>>>>>>>>>> case, having
>>>>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API
>>>>>>>>>> contract, seems
>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that is
>>>>>>>>>> explicitly for
>>>>>>>>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all
>>>>>>>>>> are producer
>>>>>>>>>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
>>>>>>>>>>>>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
>>>>>>>>>> configurations
>>>>>>>>>>>>>>>>>>>>> together in
>>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like
>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g
>>>>>>>> topics
>>>>>>>>>> created
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we
>>>>>>>>>> should leave
>>>>>>>>>>>>>>>>>>>>> it as
>>>>>>>>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we
>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should
>>>>>>>>>> create and
>>>>>>>>>>>>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that
>>>>>>>>>> should be
>>>>>>>>>>>>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions
>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level
>>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in
>>>>>>>>>> KStream#through - I
>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users
>>>>>>>>>> possibility to
>>>>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :)
>>>>>>>> Considering
>>>>>>>>>> arguments
>>>>>>>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two
>>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
>>>>>>>> will
>>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce
>>>>>>>>>> new method
>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this
>>>>>>>>>> in one of
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the
>>>>>>>>>> following
>>>>>>>>>>>>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
>>>>>>>> Grouped
>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
>>>>>>>>>> KStream#through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
>>>>>>>> will
>>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
>>>>>>>> Grouped
>>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for
>>>>>>>>>> creating and
>>>>>>>>>>>>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already
>>>>>>>>>> discussed in the
>>>>>>>>>>>>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were
>>>>>>>>>> about this
>>>>>>>>>>>>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>>>>> levani.codes@gmail.com>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around
>>>>>>>>>> KIP-221. Going
>>>>>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement
>>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I
>>>>>>>>>> understood, the
>>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through
>>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
>>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to
>>>>>>>> control
>>>>>>>>>>>>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful
>>>>>>>> operations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
>>>>>>>>>>>>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
> 
> 


Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by Levani Kokhreidze <le...@gmail.com>.
Thank you all for an interesting discussion. This is very enlightening.

Thank you Matthias for your explanation. Your arguments are very true. It makes sense that if user specifies number of partitions he/she really cares that those specifications are applied to internal topics.
Unfortunately, in current implementation this is not true during `join` operation. As I’ve written in the PR comment, currently, when `Stream#join` is used, `CopartitionedTopicsEnforcer` chooses max number of partitions from the two source topics.
I’m not really sure what would be the other way around this situation. Maybe fail the stream altogether and inform the user to specify same number of partitions? 
Or we should treat join operations in a same way as it is right now and basically choose max number of partitions even when `repartition` operation is specified, because Kafka Streams “knows the best” how to handle joins?
You can check integration tests how it’s being handled currently. Open to suggestions on that part.

As for groupBy, I agree and John raised very interesting points. My arguments for allowing users to specify number of partitions during groupBy operations mainly was coming from the usability perspective.
So building on top of what John said, maybe it makes sense to make `groupBy` operations smarter and whenever user specifies `numberOfPartitions` configuration, repartitioning will be enforced, wdyt?
I’m not going into optimization part yet :) I think it will be part of separate PR and task, but overall it makes sense to apply optimizations where number of partitions are the same.

As for Repartitioned extending Grouped, I kinda feel that it won’t fit nicely in current API design. 
In addition, in the PR review, John mentioned that there were a lot of troubles in the past trying to use one operation's configuration objects on other operations. 
Also it makes sense to keep them separate in terms of compatibility. 
In that case, we don’t have to worry every time Grouped is changed, what would be the implications on `repartition` operations.

Kind regards,
Levani


> On Nov 11, 2019, at 9:13 PM, John Roesler <jo...@confluent.io> wrote:
> 
> Ah, thanks for the clarification. I missed your point.
> 
> I like the framework you've presented. It does seem simpler to assume
> that they either care about the partition count and want to
> repartition to realize it, or they don't care about the number.
> Returning to this discussion, it does seem unlikely that they care
> about the number and _don't_ care if it actually gets realized.
> 
> But then, it still seems like we can just keep the option as part of
> Grouped. As in:
> 
> // user does not care
> stream.groupByKey(Grouped /*not specifying partition count*/)
> stream.groupBy(Grouped /*not specifying partition count*/)
> 
> // user does care
> stream.repartition(Repartitioned)
> stream.groupByKey(Grouped.numberOfPartitions(...))
> stream.groupBy(Grouped.numberOfPartitions(...))
> 
> ----
> 
> The above discussion got me thinking about algebra. Matthias is
> absolutely right that `groupByKey(numPartitions)` is equivalent to
> `repartition(numPartitions).groupByKey()`. I'm just not convinced that
> we should force people to apply that expansion themselves vs. having a
> more compact way to express it if they don't care where exactly the
> repartition occurs. However, thinking about these operators
> algebraically can really help *us* narrow down the number of different
> expressions we have to consider.
> 
> Let's consider some identities:
> 
> A: groupBy(mapper) + agg = mapKey(mapper) + groupByKey + agg
> B: src + ... + groupByKey + agg = src + ... + passthough + agg
> C: mapKey(mapper) + ... + groupByKey + agg
> = mapKey(mapper) + ... + repartition + groupByKey + agg
> D: repartition = sink(managed) + src
> 
> In these identities, I used one special identifier (...), which means
> any number (0+) of operations that are not src, mapKey, groupBy[Key],
> repartition, or agg.
> 
> For mental clarity, I'm just going to make up a rule that groupBy
> operations are not executable. In other words, you have to get to a
> point where you can apply B to convert a groupByKey into a passthough
> in order to execute the program. This is just a formal way of stating
> what already happens in Kafka Streams.
> 
> By applying A, we can just completely leave `groupBy` out of our
> analysis. It trivially decomposes into a mapKey followed by a
> groupByKey.
> 
> Then, we can eliminate the "repartition required" case of `groupByKey`
> by applying C followed by D to get to the "no repartition required"
> version of groupByKey, which in turn sets us up to apply B to get an
> executable topology.
> 
> Fundamentally, you can think about KIP-221 is as proposing a modified
> D identity in which you can specify the partition count of the managed
> sink topic:
> D': repartition(pc) = sink(managed w/ pc) + src
> 
> Since users _could_ apply the identities above, we don't actually have
> to add any partition count to groupBy[Key], but we decided early on in
> the KIP discussion that it's more ergonomic to add it. In that case,
> we also have to modify A and C:
> A': groupBy(mapper, pc) + agg
> = mapKey(mapper) + groupByKey(pc) + agg
> C': mapKey(mapper) + ... + groupByKey(pc) + agg
> = mapKey(mapper) + ... + repartition(pc) + groupByKey + agg
> 
> Which sets us up still to always be able to get back to a plain
> `groupByKey` operation (with no `(pc)`) and then apply D' and
> ultimately B to get an executable topology.
> 
> What about the optimizer?
> The optimizer applies another set of graph-algebraic identities to
> minimize the number of repartition topics in a topology.
> 
> (forgive my ascii art)
> 
> E: (merging repartition nodes)
> (...) -> repartition -> X
>  \-> repartition -> Y
> =
> (... + repartition) -> X
>     \-> Y
> F: (reordering around repartition)
> Where SVO is any non-key-changing, stateless, operation:
> repartition -> SVO = SVO -> repartition
> 
> In terms of these identities, what the optimizer does is apply F
> repeatedly in either direction to a topology to factor out common in
> branches so that it can apply E to merge repartition nodes. This was
> especially necessary before KIP-221 because you couldn't directly
> express `repartition` in the DSL, only indirectly via `groupBy[Key]`,
> so there was no way to do the factoring manually.
> 
> We can now state very clearly that in KIP-221, explicit
> `repartition()` operators should create a "reordering barrier". So, F
> cannot be applied to an explicit `repartition()`. Also, I think we
> decided earlier that explicit `repartition()` operations would also be
> ineligible for merging, so E can't be applied to explicit
> `repartition()` operations either. I think we feel we _could_ apply E
> without harm, but we want to be conservative for now.
> 
> I think the salient point from the latter discussion has been that
> when you use `Grouped.numberOfPartitions`, this does _not_ constitute
> an explicit `repartition()` operator, and therefore, the resulting
> repartition node remains eligible for optimization.
> 
> To be clear, I agree with Matthias that the provided partition count
> _must_ be used in the resulting implicit repartition. This has some
> implications for E. Namely, E could only be applied to two repartition
> nodes that have the same partition count. This has always been
> trivially true before KIP-221 because the partition count has always
> been "unspecified", i.e., it would be determined at runtime by the
> user-managed-topics' partition counts. Now, it could be specified or
> unspecified. We can simply augment E to allow merging only repartition
> nodes where the partition count is EITHER "specified and the same on
> both sides", OR "unspecified on both sides".
> 
> Sorry for the long email, but I have a hope that it builds a solid
> theoretical foundation for our decisions in KIP-221, so we can have
> confidence that there are no edge cases for design flaws to hide.
> 
> Thanks,
> -John
> 
> On Sat, Nov 9, 2019 at 10:37 PM Matthias J. Sax <ma...@confluent.io> wrote:
>> 
>>> it seems like we do want to allow
>>>> people to optionally specify a partition count as part of this
>>>> operation, but we don't want that option to _force_ repartitioning
>> 
>> Correct, ie, that is my suggestions.
>> 
>>> "Use P partitions if repartitioning is necessary"
>> 
>> I disagree here, because my reasoning is that:
>> 
>> - if a user cares about the number of partition, the user wants to
>> enforce a repartitioning
>> - if a user does not case about the number of partitions, we don't need
>> to provide them a way to pass in a "hint"
>> 
>> Hence, it should be sufficient to support:
>> 
>> // user does not care
>> 
>>  `stream.groupByKey(Grouped)`
>>  `stream.grouBy(..., Grouped)`
>> 
>> // user does care
>> 
>>  `stream.repartition(Repartitioned).groupByKey()`
>>  `streams.groupBy(..., Repartitioned)`
>> 
>> 
>> 
>> -Matthias
>> 
>> 
>> On 11/9/19 8:10 PM, John Roesler wrote:
>>> Thanks for those thoughts, Matthias,
>>> 
>>> I find your reasoning about the optimization behavior compelling. The
>>> `through` operation is very simple and clear to reason about. It just
>>> passes the data exactly at the specified point in the topology exactly
>>> through the specified topic. Likewise, if a user invokes a
>>> `repartition` operator, the simplest behavior is if we just do what
>>> they asked for.
>>> 
>>> Stepping back to think about when optimizations are surprising and
>>> when they aren't, it occurs to me that we should be free to move
>>> around repartitions when users have asked to perform some operation
>>> that implies a repartition, like "change keys, then filter, then
>>> aggregate". This program requires a repartition, but it could be
>>> anywhere between the key change and the aggregation. On the other
>>> hand, if they say, "change keys, then filter, then repartition, then
>>> aggregate", it seems like they were pretty clear about their desire,
>>> and we should just take it at face value.
>>> 
>>> So, I'm sold on just literally doing a repartition every time they
>>> invoke the `repartition` operator.
>>> 
>>> 
>>> The "partition count" modifier for `groupBy`/`groupByKey` is more nuanced.
>>> 
>>> What you said about `groupByKey` makes sense. If they key hasn't
>>> actually changed, then we don't need to repartition before
>>> aggregating. On the other hand, `groupBy` is specifically changing the
>>> key as part of the grouping operation, so (as you said) we definitely
>>> have to do a repartition.
>>> 
>>> If I'm reading the discussion right, it seems like we do want to allow
>>> people to optionally specify a partition count as part of this
>>> operation, but we don't want that option to _force_ repartitioning if
>>> it's not needed. That last clause is the key. "Use P partitions if
>>> repartitioning is necessary" is a directive that applies cleanly and
>>> correctly to both `groupBy` and `groupByKey`. What if we call the
>>> option `numberOfPartitionsHint`, which along with the "if necessary"
>>> javadoc, should make it clear that the option won't force a
>>> repartition, and also gives us enough latitude to still employ the
>>> optimizer on those repartition topics?
>>> 
>>> If we like the idea of expressing it as a "hint" for grouping and a
>>> "command" for `repartition`, then it seems like it still makes sense
>>> to keep Grouped and Repartitioned separate, as they would actually
>>> offer different methods with distinct semantics.
>>> 
>>> WDYT?
>>> 
>>> Thanks,
>>> -John
>>> 
>>> On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <ma...@confluent.io> wrote:
>>>> 
>>>> Sorry for late reply.
>>>> 
>>>> I guess, the question boils down to the intended semantics of
>>>> `repartition()`. My understanding is as follows:
>>>> 
>>>> - KS does auto-repartitioning for correctness reasons (using the
>>>> upstream topic to determine the number of partitions)
>>>> - KS does auto-repartitioning only for downstream DSL operators like
>>>> `count()` (eg, a `transform()` does never trigger an auto-repartitioning
>>>> even if the stream is marked as `repartitioningRequired`).
>>>> - KS offers `through()` to enforce a repartitioning -- however, the user
>>>> needs to create the topic manually (with the desired number of partitions).
>>>> 
>>>> I see two main applications for `repartitioning()`:
>>>> 
>>>> 1) repartition data before a `transform()` but user does not want to
>>>> manage the topic
>>>> 2) scale out a downstream subtopology
>>>> 
>>>> Hence, I see `repartition()` similar to `through()`: if a users calls
>>>> it, a repartitining is enforced, with the difference that KS manages the
>>>> topic and the user does not need to create it.
>>>> 
>>>> This behavior makes (1) and (2) possible.
>>>> 
>>>>> I think many users would prefer to just say "if there *is* a repartition
>>>>> required at this point in the topology, it should
>>>>> have N partitions"
>>>> 
>>>> Because of (2), I disagree. Either a user does not care about scaling
>>>> out, for which case she would not specify the number of partitions. Or a
>>>> user does care, and hence wants to enforce the scale out. I don't think
>>>> that any user would say, "maybe scale out".
>>>> 
>>>> Therefore, the optimizer should never ignore the repartition operation.
>>>> As a "consequence" (because repartitioning is expensive) a user should
>>>> make an explicit call to `repartition()` IMHO -- piggybacking an
>>>> enforced repartitioning into `groupByKey()` seems to be "dangerous"
>>>> because it might be too subtle and an "optional scaling out" as laid out
>>>> above does not make sense IMHO.
>>>> 
>>>> I am also not worried about "over repartitioning" because the result
>>>> stream would never trigger auto-repartitioning. Only if multiple
>>>> consecutive calls to `repartition()` are made it could be bad -- but
>>>> that's the same with `through()`. In the end, there is always some
>>>> responsibility on the user.
>>>> 
>>>> Btw, for `.groupBy()` we know that repartitioning will be required,
>>>> however, for `groupByKey()` it depends if the KStream is marked as
>>>> `repartitioningRequired`.
>>>> 
>>>> Hence, for `groupByKey()` it should not be possible for a user to set
>>>> number of partitions IMHO. For `groupBy()` it's a different story,
>>>> because calling
>>>> 
>>>>   `repartition().groupBy()`
>>>> 
>>>> does not achieve what we want. Hence, allowing users to pass in the
>>>> number of users partitions into `groupBy()` does actually makes sense,
>>>> because repartitioning will happen anyway and thus we can piggyback a
>>>> scaling decision.
>>>> 
>>>> I think that John has a fair concern about the overloads, however, I am
>>>> not convinced that using `Grouped` to specify the number of partitions
>>>> is intuitive. I double checked `Grouped` and `Repartitioned` and both
>>>> allow to specify a `name` and `keySerde/valueSerde`. Thus, I am
>>>> wondering if we could bridge the gap between both, if we would make
>>>> `Repartitioned extends Grouped`? For this case, we only need
>>>> `groupBy(Grouped)` and a user can pass in both types what seems to make
>>>> the API quite smooth:
>>>> 
>>>>  `stream.groupBy(..., Grouped...)`
>>>> 
>>>>  `stream.groupBy(..., Repartitioned...)`
>>>> 
>>>> 
>>>> Thoughts?
>>>> 
>>>> 
>>>> -Matthias
>>>> 
>>>> 
>>>> 
>>>> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
>>>>> Hi Sophie,
>>>>> 
>>>>> Thank you for your reply, very insightful. Looking forward hearing others opinion as well on this.
>>>>> 
>>>>> Kind regards,
>>>>> Levani
>>>>> 
>>>>> 
>>>>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <so...@confluent.io> wrote:
>>>>>> 
>>>>>>> Personally, I think Matthias’s concern is valid, but on the other hand
>>>>>> Kafka Streams has already
>>>>>>> optimizer in place which alters topology independently from user
>>>>>> 
>>>>>> I agree (with you) and think this is a good way to put it -- we currently
>>>>>> auto-repartition for the user so
>>>>>> that they don't have to walk through their entire topology and reason about
>>>>>> when and where to place a
>>>>>> `.through` (or the new `.repartition`), so why suddenly force this onto the
>>>>>> user? How certain are we that
>>>>>> users will always get this right? It's easy to imagine that during
>>>>>> development, you write your new app with
>>>>>> correctly placed repartitions in order to use this new feature. During the
>>>>>> course of development you end up
>>>>>> tweaking the topology, but don't remember to review or move the
>>>>>> repartitioning since you're used to Streams
>>>>>> doing this for you. If you use only single-partition topics for testing,
>>>>>> you might not even notice your app is
>>>>>> spitting out incorrect results!
>>>>>> 
>>>>>> Anyways, I feel pretty strongly that it would be weird to introduce a new
>>>>>> feature and say that to use it, you can't take
>>>>>> advantage of this other feature anymore. Also, is it possible our
>>>>>> optimization framework could ever include an
>>>>>> optimized repartitioning strategy that is better than what a user could
>>>>>> achieve by manually inserting repartitions?
>>>>>> Do we expect users to have a deep understanding of the best way to
>>>>>> repartition their particular topology, or is it
>>>>>> likely they will end up over-repartitioning either due to missed
>>>>>> optimizations or unnecessary extra repartitions?
>>>>>> I think many users would prefer to just say "if there *is* a repartition
>>>>>> required at this point in the topology, it should
>>>>>> have N partitions"
>>>>>> 
>>>>>> As to the idea of adding `numberOfPartitions` to Grouped rather than
>>>>>> adding a new parameter to groupBy, that does seem more in line with the
>>>>>> current syntax so +1 from me
>>>>>> 
>>>>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <le...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hello all,
>>>>>>> 
>>>>>>> While https://github.com/apache/kafka/pull/7170 <
>>>>>>> https://github.com/apache/kafka/pull/7170> is under review and it’s
>>>>>>> almost done, I want to resurrect discussion about this KIP to address
>>>>>>> couple of concerns raised by Matthias and John.
>>>>>>> 
>>>>>>> As a reminder, idea of the KIP-221 was to allow DSL users control over
>>>>>>> repartitioning and parallelism of sub-topologies by:
>>>>>>> 1) Introducing new KStream#repartition operation which is done in
>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>> https://github.com/apache/kafka/pull/7170>
>>>>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which is planned to
>>>>>>> be separate PR.
>>>>>>> 
>>>>>>> While all agree about general implementation and idea behind
>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>> https://github.com/apache/kafka/pull/7170> PR, introducing new
>>>>>>> KStream#groupBy(Repartitioned) method overload raised some questions during
>>>>>>> the review.
>>>>>>> Matthias raised concern that there can be cases when user uses
>>>>>>> `KStream#groupBy(Repartitioned)` operation, but actual repartitioning may
>>>>>>> not required, thus configuration passed via `Repartitioned` would never be
>>>>>>> applied (Matthias, please correct me if I misinterpreted your comment).
>>>>>>> So instead, if user wants to control parallelism of sub-topologies, he or
>>>>>>> she should always use `KStream#repartition` operation before groupBy. Full
>>>>>>> comment can be seen here:
>>>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
>>>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
>>>>>>> 
>>>>>>> On the same topic, John pointed out that, from API design perspective, we
>>>>>>> shouldn’t intertwine configuration classes of different operators between
>>>>>>> one another. So instead of introducing new `KStream#groupBy(Repartitioned)`
>>>>>>> for specifying number of partitions for internal topic, we should update
>>>>>>> existing `Grouped` class with `numberOfPartitions` field.
>>>>>>> 
>>>>>>> Personally, I think Matthias’s concern is valid, but on the other hand
>>>>>>> Kafka Streams has already optimizer in place which alters topology
>>>>>>> independently from user. So maybe it makes sense if Kafka Streams,
>>>>>>> internally would optimize topology in the best way possible, even if in
>>>>>>> some cases this means ignoring some operator configurations passed by the
>>>>>>> user. Also, I agree with John about API design semantics. If we go through
>>>>>>> with the changes for `KStream#groupBy` operation, it makes more sense to
>>>>>>> add `numberOfPartitions` field to `Grouped` class instead of introducing
>>>>>>> new `KStream#groupBy(Repartitioned)` method overload.
>>>>>>> 
>>>>>>> I would really appreciate communities feedback on this.
>>>>>>> 
>>>>>>> Kind regards,
>>>>>>> Levani
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <so...@confluent.io>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hey Levani,
>>>>>>>> 
>>>>>>>> I think people are busy with the upcoming 2.4 release, and don't have
>>>>>>> much
>>>>>>>> spare time at the
>>>>>>>> moment. It's kind of a difficult time to get attention on things, but
>>>>>>> feel
>>>>>>>> free to pick up something else
>>>>>>>> to work on in the meantime until things have calmed down a bit!
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Sophie
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hello all,
>>>>>>>>> 
>>>>>>>>> Sorry for bringing this thread again, but I would like to get some
>>>>>>>>> attention on this PR: https://github.com/apache/kafka/pull/7170 <
>>>>>>> https://github.com/apache/kafka/pull/7170> <
>>>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>>>> https://github.com/apache/kafka/pull/7170>>
>>>>>>>>> It's been a while now and I would love to move on to other KIPs as well.
>>>>>>>>> Please let me know if you have any concerns.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Levani
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi all,
>>>>>>>>>> 
>>>>>>>>>> Here’s voting thread for this KIP:
>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html> <
>>>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>>
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Levani
>>>>>>>>>> 
>>>>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion on that one.
>>>>>>>>>>> Agree that avoiding unnecessary method overloads is a good idea.
>>>>>>>>>>> 
>>>>>>>>>>> Updated KIP
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Levani
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <matthias@confluent.io
>>>>>>> <ma...@confluent.io>
>>>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> One question:
>>>>>>>>>>>> 
>>>>>>>>>>>> Why do we add
>>>>>>>>>>>> 
>>>>>>>>>>>>> Repartitioned#with(final String name, final int numberOfPartitions)
>>>>>>>>>>>> 
>>>>>>>>>>>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
>>>>>>>>>>>> combination with `withName()` and `withNumberOfPartitions()` should
>>>>>>> be
>>>>>>>>>>>> sufficient. Users can chain the method calls.
>>>>>>>>>>>> 
>>>>>>>>>>>> (I think it's valuable to keep the number of overload small if
>>>>>>>>> possible.)
>>>>>>>>>>>> 
>>>>>>>>>>>> Otherwise LGTM.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> -Matthias
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks all for your feedback.
>>>>>>>>>>>>> I started voting procedure for this KIP. If there’re any other
>>>>>>>>> concerns about this KIP, please let me know.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Levani
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for the suggestion, makes sense.
>>>>>>>>>>>>>> I’ve updated KIP (
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>> <
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>> 
>>>>>>>>> <
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>> <
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>> 
>>>>>>>>> <
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>> <
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>> 
>>>>>>>>> <
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>> <
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>> 
>>>>>>>>>>> ).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
>>>>>>> matthias@confluent.io <ma...@confluent.io>
>>>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>> matthias@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>> matthias@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for driving the KIP.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I agree that users need to be able to specify a partitioning
>>>>>>>>> strategy.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sophie raises a fair point about topic configs and producer
>>>>>>>>> configs. My
>>>>>>>>>>>>>>> take is, that consider `Repartitioned` as an "extension" to
>>>>>>>>> `Produced`,
>>>>>>>>>>>>>>> that adds topic configuration, is a good way to think about it and
>>>>>>>>> helps
>>>>>>>>>>>>>>> to keep the API "clean".
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> With regard to method names. I would prefer to avoid
>>>>>>> abbreviations.
>>>>>>>>> Can
>>>>>>>>>>>>>>> we rename:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Furthermore, it might be good to add some more `static` methods:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>>>>>>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>> Totally agree. I think in KStream interface it makes sense to
>>>>>>> have
>>>>>>>>> some duplicate configurations between operators in order to keep API
>>>>>>> simple
>>>>>>>>> and usable.
>>>>>>>>>>>>>>>> Also, as more surface API has, harder it is to have proper
>>>>>>>>> backward compatibility.
>>>>>>>>>>>>>>>> While initial idea of keeping topic level configs separate was
>>>>>>>>> exciting, having Repartitioned class encapsulate some producer level
>>>>>>>>> configs makes API more readable.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>>>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I think that is a good point about trying to keep producer level
>>>>>>>>>>>>>>>>> configurations and (repartition) topic level considerations
>>>>>>>>> separate.
>>>>>>>>>>>>>>>>> Number of partitions is definitely purely a topic level
>>>>>>>>> configuration. But
>>>>>>>>>>>>>>>>> on some level, serdes and partitioners are just as much a topic
>>>>>>>>>>>>>>>>> configuration as a producer one. You could have two producers
>>>>>>>>> configured
>>>>>>>>>>>>>>>>> with different serdes and/or partitioners, but if they are
>>>>>>>>> writing to the
>>>>>>>>>>>>>>>>> same topic the result would be very difficult to part. So in a
>>>>>>>>> sense, these
>>>>>>>>>>>>>>>>> are configurations of topics in Streams, not just producers.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Another way to think of it: while the Streams API is not always
>>>>>>>>> true to
>>>>>>>>>>>>>>>>> this, ideally all the relevant configs for an operator are
>>>>>>>>> wrapped into a
>>>>>>>>>>>>>>>>> single object (in this case, Repartitioned). We could instead
>>>>>>>>> split out the
>>>>>>>>>>>>>>>>> fields in common with Produced into a separate parameter to keep
>>>>>>>>> topic and
>>>>>>>>>>>>>>>>> producer level configurations separate, but this increases the
>>>>>>>>> API surface
>>>>>>>>>>>>>>>>> area by a lot. It's much more straightforward to just say "this
>>>>>>> is
>>>>>>>>>>>>>>>>> everything that this particular operator needs" without worrying
>>>>>>>>> about what
>>>>>>>>>>>>>>>>> exactly you're specifying.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I suppose you could alternatively make Produced a field of
>>>>>>>>> Repartitioned,
>>>>>>>>>>>>>>>>> but I don't think we do this kind of composition elsewhere in
>>>>>>>>> Streams at
>>>>>>>>>>>>>>>>> the moment
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>> levani.codes@gmail.com <ma...@gmail.com>> <mailto:
>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Bill,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
>>>>>>>>> `Repartitioned#partitioner`
>>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class for topic level
>>>>>>>>>>>>>>>>>> configuration and keep topic level and producer level
>>>>>>>>> configurations (such
>>>>>>>>>>>>>>>>>> as Produced) separately (see my second email in this thread).
>>>>>>>>>>>>>>>>>> But while looking at the semantics of KStream interface, I
>>>>>>>>> couldn’t really
>>>>>>>>>>>>>>>>>> figure out good operation name for Topic level configuration
>>>>>>>>> class and just
>>>>>>>>>>>>>>>>>> introducing `Topic` config class was kinda breaking the
>>>>>>>>> semantics.
>>>>>>>>>>>>>>>>>> So I think having Repartitioned class which encapsulates topic
>>>>>>>>> and
>>>>>>>>>>>>>>>>>> producer level configurations for internal topics is viable
>>>>>>>>> thing to do.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <bbejeck@gmail.com
>>>>>>> <ma...@gmail.com>
>>>>>>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>> <mailto:
>>>>>>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
>>>>>>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Lavani,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.  In addition to
>>>>>>>>> the reason
>>>>>>>>>>>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 1. Users may want to use something other than hash-based
>>>>>>>>> partitioning
>>>>>>>>>>>>>>>>>>> 2. Users may wish to partition on something different than the
>>>>>>>>> key
>>>>>>>>>>>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>>>>>>>>>>> 1. A combination of fields in the value in conjunction with
>>>>>>>>> the key
>>>>>>>>>>>>>>>>>>> 2. Something other than the key
>>>>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence
>>>>>>> in
>>>>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes sense for API
>>>>>>>>> consistency.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Just my  2 cents.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>> <mailto:
>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>>>>>>>>>>>> If we add partitioner configuration to Repartitioned class,
>>>>>>>>> with the
>>>>>>>>>>>>>>>>>>>> combination of specifying number of partitions for internal
>>>>>>>>> topics, user
>>>>>>>>>>>>>>>>>>>> will have opportunity to ensure co-partitioning before join
>>>>>>>>> operation.
>>>>>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>>>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
>>>>>>> john@confluent.io <ma...@confluent.io>
>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:
>>>>>>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally
>>>>>>>>> sure it
>>>>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar is the
>>>>>>> rationale
>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hey John,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your example I would
>>>>>>>>> first issue
>>>>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner that
>>>>>>>>> essentially
>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>> be the same as the topic I want to join with and then do the
>>>>>>>>>>>>>>>>>> KStream#join
>>>>>>>>>>>>>>>>>>>> with DSL?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
>>>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:john@confluent.io
>>>>>>> <ma...@confluent.io>> <mailto:john@confluent.io <mailto:
>>>>>>> john@confluent.io>
>>>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an option to specify
>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may
>>>>>>>>> get
>>>>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the
>>>>>>>>> right-side
>>>>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
>>>>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the
>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss something
>>>>>>>>> important?
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
>>>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m
>>>>>>> not
>>>>>>>>> sure
>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would
>>>>>>>>> need control
>>>>>>>>>>>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best
>>>>>>>>> how to
>>>>>>>>>>>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for
>>>>>>>>> topics
>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to
>>>>>>>>> specify
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> partitioner.
>>>>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess
>>>>>>>>> specifying the
>>>>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s agreement on
>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned
>>>>>>>>> would be a
>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
>>>>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to
>>>>>>>>> Produced? I'm not
>>>>>>>>>>>>>>>>>>>> sure how
>>>>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but seems it should
>>>>>>> be
>>>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
>>>>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over
>>>>>>>>> the topic
>>>>>>>>>>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that
>>>>>>> is
>>>>>>>>> missing
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by
>>>>>>>>> configuring
>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me
>>>>>>> --
>>>>>>>>> for the
>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned,
>>>>>>>>> will the
>>>>>>>>>>>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case
>>>>>>> for
>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> signature
>>>>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be
>>>>>>>>> pre-created? The
>>>>>>>>>>>>>>>>>>>> wording
>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method
>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>> <
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>> 
>>>>>>>>> <
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>> <
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>> 
>>>>>>>>> <
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>> <
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>> 
>>>>>>>>> <
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>> <
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>> <
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>> 
>>>>>>>>> <
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns
>>>>>>> about
>>>>>>>>> this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated
>>>>>>>>> confluence
>>>>>>>>>>>>>>>>>> page
>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic`
>>>>>>>>> class for
>>>>>>>>>>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators like
>>>>>>> `groupedBy`.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this
>>>>>>> as
>>>>>>>>> well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
>>>>>>>>>>>>>>>>>>>> matthias@confluent.io
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for
>>>>>>>>> summarizing
>>>>>>>>>>>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed
>>>>>>>>> already (can't
>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to
>>>>>>>>> refresh the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard
>>>>>>> to
>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and
>>>>>>> operators
>>>>>>>>> that use
>>>>>>>>>>>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might
>>>>>>> indicate
>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses
>>>>>>>>> use-defined
>>>>>>>>>>>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an
>>>>>>>>> operator that
>>>>>>>>>>>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is
>>>>>>>>> one
>>>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but
>>>>>>>>> sometimes
>>>>>>>>>>>>>>>>>> not; a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction
>>>>>>>>> clearer.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am
>>>>>>>>> wondering
>>>>>>>>>>>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding
>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of
>>>>>>>>> others on
>>>>>>>>>>>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow
>>>>>>> to
>>>>>>>>>>>>>>>>>> configure
>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we
>>>>>>>>> assume that
>>>>>>>>>>>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method
>>>>>>>>> instead of
>>>>>>>>>>>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other
>>>>>>> configs,
>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the
>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like
>>>>>>>>> them
>>>>>>>>>>>>>>>>>> myself.
>>>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe
>>>>>>> enhancing
>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach.
>>>>>>>>> Let me
>>>>>>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this
>>>>>>>>> configuration,
>>>>>>>>>>>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is
>>>>>>>>> the final
>>>>>>>>>>>>>>>>>>>> sink of
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable
>>>>>>> assumption
>>>>>>>>> that user
>>>>>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that
>>>>>>>>> case, having
>>>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API
>>>>>>>>> contract, seems
>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that is
>>>>>>>>> explicitly for
>>>>>>>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all
>>>>>>>>> are producer
>>>>>>>>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
>>>>>>>>>>>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
>>>>>>>>> configurations
>>>>>>>>>>>>>>>>>>>> together in
>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like
>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g
>>>>>>> topics
>>>>>>>>> created
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we
>>>>>>>>> should leave
>>>>>>>>>>>>>>>>>>>> it as
>>>>>>>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we
>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should
>>>>>>>>> create and
>>>>>>>>>>>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that
>>>>>>>>> should be
>>>>>>>>>>>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions
>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level
>>>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in
>>>>>>>>> KStream#through - I
>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users
>>>>>>>>> possibility to
>>>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :)
>>>>>>> Considering
>>>>>>>>> arguments
>>>>>>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two
>>>>>>>>> options
>>>>>>>>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
>>>>>>> will
>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce
>>>>>>>>> new method
>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this
>>>>>>>>> in one of
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the
>>>>>>>>> following
>>>>>>>>>>>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
>>>>>>> Grouped
>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
>>>>>>>>> KStream#through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
>>>>>>> will
>>>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>>>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>>>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
>>>>>>> Grouped
>>>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for
>>>>>>>>> creating and
>>>>>>>>>>>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already
>>>>>>>>> discussed in the
>>>>>>>>>>>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were
>>>>>>>>> about this
>>>>>>>>>>>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>>>> levani.codes@gmail.com>>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around
>>>>>>>>> KIP-221. Going
>>>>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement
>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I
>>>>>>>>> understood, the
>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through
>>>>>>>>> method
>>>>>>>>>>>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to
>>>>>>> control
>>>>>>>>>>>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful
>>>>>>> operations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
>>>>>>>>>>>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 


Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by John Roesler <jo...@confluent.io>.
Ah, thanks for the clarification. I missed your point.

I like the framework you've presented. It does seem simpler to assume
that they either care about the partition count and want to
repartition to realize it, or they don't care about the number.
Returning to this discussion, it does seem unlikely that they care
about the number and _don't_ care if it actually gets realized.

But then, it still seems like we can just keep the option as part of
Grouped. As in:

// user does not care
stream.groupByKey(Grouped /*not specifying partition count*/)
stream.groupBy(Grouped /*not specifying partition count*/)

// user does care
stream.repartition(Repartitioned)
stream.groupByKey(Grouped.numberOfPartitions(...))
stream.groupBy(Grouped.numberOfPartitions(...))

----

The above discussion got me thinking about algebra. Matthias is
absolutely right that `groupByKey(numPartitions)` is equivalent to
`repartition(numPartitions).groupByKey()`. I'm just not convinced that
we should force people to apply that expansion themselves vs. having a
more compact way to express it if they don't care where exactly the
repartition occurs. However, thinking about these operators
algebraically can really help *us* narrow down the number of different
expressions we have to consider.

Let's consider some identities:

A: groupBy(mapper) + agg = mapKey(mapper) + groupByKey + agg
B: src + ... + groupByKey + agg = src + ... + passthough + agg
C: mapKey(mapper) + ... + groupByKey + agg
 = mapKey(mapper) + ... + repartition + groupByKey + agg
D: repartition = sink(managed) + src

In these identities, I used one special identifier (...), which means
any number (0+) of operations that are not src, mapKey, groupBy[Key],
repartition, or agg.

For mental clarity, I'm just going to make up a rule that groupBy
operations are not executable. In other words, you have to get to a
point where you can apply B to convert a groupByKey into a passthough
in order to execute the program. This is just a formal way of stating
what already happens in Kafka Streams.

By applying A, we can just completely leave `groupBy` out of our
analysis. It trivially decomposes into a mapKey followed by a
groupByKey.

Then, we can eliminate the "repartition required" case of `groupByKey`
by applying C followed by D to get to the "no repartition required"
version of groupByKey, which in turn sets us up to apply B to get an
executable topology.

Fundamentally, you can think about KIP-221 is as proposing a modified
D identity in which you can specify the partition count of the managed
sink topic:
D': repartition(pc) = sink(managed w/ pc) + src

Since users _could_ apply the identities above, we don't actually have
to add any partition count to groupBy[Key], but we decided early on in
the KIP discussion that it's more ergonomic to add it. In that case,
we also have to modify A and C:
A': groupBy(mapper, pc) + agg
 = mapKey(mapper) + groupByKey(pc) + agg
C': mapKey(mapper) + ... + groupByKey(pc) + agg
 = mapKey(mapper) + ... + repartition(pc) + groupByKey + agg

Which sets us up still to always be able to get back to a plain
`groupByKey` operation (with no `(pc)`) and then apply D' and
ultimately B to get an executable topology.

What about the optimizer?
The optimizer applies another set of graph-algebraic identities to
minimize the number of repartition topics in a topology.

(forgive my ascii art)

E: (merging repartition nodes)
(...) -> repartition -> X
  \-> repartition -> Y
=
(... + repartition) -> X
     \-> Y
F: (reordering around repartition)
Where SVO is any non-key-changing, stateless, operation:
repartition -> SVO = SVO -> repartition

In terms of these identities, what the optimizer does is apply F
repeatedly in either direction to a topology to factor out common in
branches so that it can apply E to merge repartition nodes. This was
especially necessary before KIP-221 because you couldn't directly
express `repartition` in the DSL, only indirectly via `groupBy[Key]`,
so there was no way to do the factoring manually.

We can now state very clearly that in KIP-221, explicit
`repartition()` operators should create a "reordering barrier". So, F
cannot be applied to an explicit `repartition()`. Also, I think we
decided earlier that explicit `repartition()` operations would also be
ineligible for merging, so E can't be applied to explicit
`repartition()` operations either. I think we feel we _could_ apply E
without harm, but we want to be conservative for now.

I think the salient point from the latter discussion has been that
when you use `Grouped.numberOfPartitions`, this does _not_ constitute
an explicit `repartition()` operator, and therefore, the resulting
repartition node remains eligible for optimization.

To be clear, I agree with Matthias that the provided partition count
_must_ be used in the resulting implicit repartition. This has some
implications for E. Namely, E could only be applied to two repartition
nodes that have the same partition count. This has always been
trivially true before KIP-221 because the partition count has always
been "unspecified", i.e., it would be determined at runtime by the
user-managed-topics' partition counts. Now, it could be specified or
unspecified. We can simply augment E to allow merging only repartition
nodes where the partition count is EITHER "specified and the same on
both sides", OR "unspecified on both sides".

Sorry for the long email, but I have a hope that it builds a solid
theoretical foundation for our decisions in KIP-221, so we can have
confidence that there are no edge cases for design flaws to hide.

Thanks,
-John

On Sat, Nov 9, 2019 at 10:37 PM Matthias J. Sax <ma...@confluent.io> wrote:
>
> > it seems like we do want to allow
> >> people to optionally specify a partition count as part of this
> >> operation, but we don't want that option to _force_ repartitioning
>
> Correct, ie, that is my suggestions.
>
> > "Use P partitions if repartitioning is necessary"
>
> I disagree here, because my reasoning is that:
>
>  - if a user cares about the number of partition, the user wants to
> enforce a repartitioning
>  - if a user does not case about the number of partitions, we don't need
> to provide them a way to pass in a "hint"
>
> Hence, it should be sufficient to support:
>
> // user does not care
>
>   `stream.groupByKey(Grouped)`
>   `stream.grouBy(..., Grouped)`
>
> // user does care
>
>   `stream.repartition(Repartitioned).groupByKey()`
>   `streams.groupBy(..., Repartitioned)`
>
>
>
> -Matthias
>
>
> On 11/9/19 8:10 PM, John Roesler wrote:
> > Thanks for those thoughts, Matthias,
> >
> > I find your reasoning about the optimization behavior compelling. The
> > `through` operation is very simple and clear to reason about. It just
> > passes the data exactly at the specified point in the topology exactly
> > through the specified topic. Likewise, if a user invokes a
> > `repartition` operator, the simplest behavior is if we just do what
> > they asked for.
> >
> > Stepping back to think about when optimizations are surprising and
> > when they aren't, it occurs to me that we should be free to move
> > around repartitions when users have asked to perform some operation
> > that implies a repartition, like "change keys, then filter, then
> > aggregate". This program requires a repartition, but it could be
> > anywhere between the key change and the aggregation. On the other
> > hand, if they say, "change keys, then filter, then repartition, then
> > aggregate", it seems like they were pretty clear about their desire,
> > and we should just take it at face value.
> >
> > So, I'm sold on just literally doing a repartition every time they
> > invoke the `repartition` operator.
> >
> >
> > The "partition count" modifier for `groupBy`/`groupByKey` is more nuanced.
> >
> > What you said about `groupByKey` makes sense. If they key hasn't
> > actually changed, then we don't need to repartition before
> > aggregating. On the other hand, `groupBy` is specifically changing the
> > key as part of the grouping operation, so (as you said) we definitely
> > have to do a repartition.
> >
> > If I'm reading the discussion right, it seems like we do want to allow
> > people to optionally specify a partition count as part of this
> > operation, but we don't want that option to _force_ repartitioning if
> > it's not needed. That last clause is the key. "Use P partitions if
> > repartitioning is necessary" is a directive that applies cleanly and
> > correctly to both `groupBy` and `groupByKey`. What if we call the
> > option `numberOfPartitionsHint`, which along with the "if necessary"
> > javadoc, should make it clear that the option won't force a
> > repartition, and also gives us enough latitude to still employ the
> > optimizer on those repartition topics?
> >
> > If we like the idea of expressing it as a "hint" for grouping and a
> > "command" for `repartition`, then it seems like it still makes sense
> > to keep Grouped and Repartitioned separate, as they would actually
> > offer different methods with distinct semantics.
> >
> > WDYT?
> >
> > Thanks,
> > -John
> >
> > On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <ma...@confluent.io> wrote:
> >>
> >> Sorry for late reply.
> >>
> >> I guess, the question boils down to the intended semantics of
> >> `repartition()`. My understanding is as follows:
> >>
> >> - KS does auto-repartitioning for correctness reasons (using the
> >> upstream topic to determine the number of partitions)
> >> - KS does auto-repartitioning only for downstream DSL operators like
> >> `count()` (eg, a `transform()` does never trigger an auto-repartitioning
> >> even if the stream is marked as `repartitioningRequired`).
> >> - KS offers `through()` to enforce a repartitioning -- however, the user
> >> needs to create the topic manually (with the desired number of partitions).
> >>
> >> I see two main applications for `repartitioning()`:
> >>
> >> 1) repartition data before a `transform()` but user does not want to
> >> manage the topic
> >> 2) scale out a downstream subtopology
> >>
> >> Hence, I see `repartition()` similar to `through()`: if a users calls
> >> it, a repartitining is enforced, with the difference that KS manages the
> >> topic and the user does not need to create it.
> >>
> >> This behavior makes (1) and (2) possible.
> >>
> >>> I think many users would prefer to just say "if there *is* a repartition
> >>> required at this point in the topology, it should
> >>> have N partitions"
> >>
> >> Because of (2), I disagree. Either a user does not care about scaling
> >> out, for which case she would not specify the number of partitions. Or a
> >> user does care, and hence wants to enforce the scale out. I don't think
> >> that any user would say, "maybe scale out".
> >>
> >> Therefore, the optimizer should never ignore the repartition operation.
> >> As a "consequence" (because repartitioning is expensive) a user should
> >> make an explicit call to `repartition()` IMHO -- piggybacking an
> >> enforced repartitioning into `groupByKey()` seems to be "dangerous"
> >> because it might be too subtle and an "optional scaling out" as laid out
> >> above does not make sense IMHO.
> >>
> >> I am also not worried about "over repartitioning" because the result
> >> stream would never trigger auto-repartitioning. Only if multiple
> >> consecutive calls to `repartition()` are made it could be bad -- but
> >> that's the same with `through()`. In the end, there is always some
> >> responsibility on the user.
> >>
> >> Btw, for `.groupBy()` we know that repartitioning will be required,
> >> however, for `groupByKey()` it depends if the KStream is marked as
> >> `repartitioningRequired`.
> >>
> >> Hence, for `groupByKey()` it should not be possible for a user to set
> >> number of partitions IMHO. For `groupBy()` it's a different story,
> >> because calling
> >>
> >>    `repartition().groupBy()`
> >>
> >> does not achieve what we want. Hence, allowing users to pass in the
> >> number of users partitions into `groupBy()` does actually makes sense,
> >> because repartitioning will happen anyway and thus we can piggyback a
> >> scaling decision.
> >>
> >> I think that John has a fair concern about the overloads, however, I am
> >> not convinced that using `Grouped` to specify the number of partitions
> >> is intuitive. I double checked `Grouped` and `Repartitioned` and both
> >> allow to specify a `name` and `keySerde/valueSerde`. Thus, I am
> >> wondering if we could bridge the gap between both, if we would make
> >> `Repartitioned extends Grouped`? For this case, we only need
> >> `groupBy(Grouped)` and a user can pass in both types what seems to make
> >> the API quite smooth:
> >>
> >>   `stream.groupBy(..., Grouped...)`
> >>
> >>   `stream.groupBy(..., Repartitioned...)`
> >>
> >>
> >> Thoughts?
> >>
> >>
> >> -Matthias
> >>
> >>
> >>
> >> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
> >>> Hi Sophie,
> >>>
> >>> Thank you for your reply, very insightful. Looking forward hearing others opinion as well on this.
> >>>
> >>> Kind regards,
> >>> Levani
> >>>
> >>>
> >>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <so...@confluent.io> wrote:
> >>>>
> >>>>> Personally, I think Matthias’s concern is valid, but on the other hand
> >>>> Kafka Streams has already
> >>>>> optimizer in place which alters topology independently from user
> >>>>
> >>>> I agree (with you) and think this is a good way to put it -- we currently
> >>>> auto-repartition for the user so
> >>>> that they don't have to walk through their entire topology and reason about
> >>>> when and where to place a
> >>>> `.through` (or the new `.repartition`), so why suddenly force this onto the
> >>>> user? How certain are we that
> >>>> users will always get this right? It's easy to imagine that during
> >>>> development, you write your new app with
> >>>> correctly placed repartitions in order to use this new feature. During the
> >>>> course of development you end up
> >>>> tweaking the topology, but don't remember to review or move the
> >>>> repartitioning since you're used to Streams
> >>>> doing this for you. If you use only single-partition topics for testing,
> >>>> you might not even notice your app is
> >>>> spitting out incorrect results!
> >>>>
> >>>> Anyways, I feel pretty strongly that it would be weird to introduce a new
> >>>> feature and say that to use it, you can't take
> >>>> advantage of this other feature anymore. Also, is it possible our
> >>>> optimization framework could ever include an
> >>>> optimized repartitioning strategy that is better than what a user could
> >>>> achieve by manually inserting repartitions?
> >>>> Do we expect users to have a deep understanding of the best way to
> >>>> repartition their particular topology, or is it
> >>>> likely they will end up over-repartitioning either due to missed
> >>>> optimizations or unnecessary extra repartitions?
> >>>> I think many users would prefer to just say "if there *is* a repartition
> >>>> required at this point in the topology, it should
> >>>> have N partitions"
> >>>>
> >>>> As to the idea of adding `numberOfPartitions` to Grouped rather than
> >>>> adding a new parameter to groupBy, that does seem more in line with the
> >>>> current syntax so +1 from me
> >>>>
> >>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <le...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hello all,
> >>>>>
> >>>>> While https://github.com/apache/kafka/pull/7170 <
> >>>>> https://github.com/apache/kafka/pull/7170> is under review and it’s
> >>>>> almost done, I want to resurrect discussion about this KIP to address
> >>>>> couple of concerns raised by Matthias and John.
> >>>>>
> >>>>> As a reminder, idea of the KIP-221 was to allow DSL users control over
> >>>>> repartitioning and parallelism of sub-topologies by:
> >>>>> 1) Introducing new KStream#repartition operation which is done in
> >>>>> https://github.com/apache/kafka/pull/7170 <
> >>>>> https://github.com/apache/kafka/pull/7170>
> >>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which is planned to
> >>>>> be separate PR.
> >>>>>
> >>>>> While all agree about general implementation and idea behind
> >>>>> https://github.com/apache/kafka/pull/7170 <
> >>>>> https://github.com/apache/kafka/pull/7170> PR, introducing new
> >>>>> KStream#groupBy(Repartitioned) method overload raised some questions during
> >>>>> the review.
> >>>>> Matthias raised concern that there can be cases when user uses
> >>>>> `KStream#groupBy(Repartitioned)` operation, but actual repartitioning may
> >>>>> not required, thus configuration passed via `Repartitioned` would never be
> >>>>> applied (Matthias, please correct me if I misinterpreted your comment).
> >>>>> So instead, if user wants to control parallelism of sub-topologies, he or
> >>>>> she should always use `KStream#repartition` operation before groupBy. Full
> >>>>> comment can be seen here:
> >>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
> >>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
> >>>>>
> >>>>> On the same topic, John pointed out that, from API design perspective, we
> >>>>> shouldn’t intertwine configuration classes of different operators between
> >>>>> one another. So instead of introducing new `KStream#groupBy(Repartitioned)`
> >>>>> for specifying number of partitions for internal topic, we should update
> >>>>> existing `Grouped` class with `numberOfPartitions` field.
> >>>>>
> >>>>> Personally, I think Matthias’s concern is valid, but on the other hand
> >>>>> Kafka Streams has already optimizer in place which alters topology
> >>>>> independently from user. So maybe it makes sense if Kafka Streams,
> >>>>> internally would optimize topology in the best way possible, even if in
> >>>>> some cases this means ignoring some operator configurations passed by the
> >>>>> user. Also, I agree with John about API design semantics. If we go through
> >>>>> with the changes for `KStream#groupBy` operation, it makes more sense to
> >>>>> add `numberOfPartitions` field to `Grouped` class instead of introducing
> >>>>> new `KStream#groupBy(Repartitioned)` method overload.
> >>>>>
> >>>>> I would really appreciate communities feedback on this.
> >>>>>
> >>>>> Kind regards,
> >>>>> Levani
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <so...@confluent.io>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hey Levani,
> >>>>>>
> >>>>>> I think people are busy with the upcoming 2.4 release, and don't have
> >>>>> much
> >>>>>> spare time at the
> >>>>>> moment. It's kind of a difficult time to get attention on things, but
> >>>>> feel
> >>>>>> free to pick up something else
> >>>>>> to work on in the meantime until things have calmed down a bit!
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Sophie
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
> >>>>> levani.codes@gmail.com <ma...@gmail.com>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hello all,
> >>>>>>>
> >>>>>>> Sorry for bringing this thread again, but I would like to get some
> >>>>>>> attention on this PR: https://github.com/apache/kafka/pull/7170 <
> >>>>> https://github.com/apache/kafka/pull/7170> <
> >>>>>>> https://github.com/apache/kafka/pull/7170 <
> >>>>> https://github.com/apache/kafka/pull/7170>>
> >>>>>>> It's been a while now and I would love to move on to other KIPs as well.
> >>>>>>> Please let me know if you have any concerns.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Levani
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
> >>>>> levani.codes@gmail.com <ma...@gmail.com>>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> Here’s voting thread for this KIP:
> >>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
> >>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html> <
> >>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
> >>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>>
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Levani
> >>>>>>>>
> >>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
> >>>>> levani.codes@gmail.com <ma...@gmail.com>
> >>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Matthias,
> >>>>>>>>>
> >>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion on that one.
> >>>>>>>>> Agree that avoiding unnecessary method overloads is a good idea.
> >>>>>>>>>
> >>>>>>>>> Updated KIP
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Levani
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <matthias@confluent.io
> >>>>> <ma...@confluent.io>
> >>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> One question:
> >>>>>>>>>>
> >>>>>>>>>> Why do we add
> >>>>>>>>>>
> >>>>>>>>>>> Repartitioned#with(final String name, final int numberOfPartitions)
> >>>>>>>>>>
> >>>>>>>>>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
> >>>>>>>>>> combination with `withName()` and `withNumberOfPartitions()` should
> >>>>> be
> >>>>>>>>>> sufficient. Users can chain the method calls.
> >>>>>>>>>>
> >>>>>>>>>> (I think it's valuable to keep the number of overload small if
> >>>>>>> possible.)
> >>>>>>>>>>
> >>>>>>>>>> Otherwise LGTM.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> -Matthias
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
> >>>>>>>>>>> Hello,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks all for your feedback.
> >>>>>>>>>>> I started voting procedure for this KIP. If there’re any other
> >>>>>>> concerns about this KIP, please let me know.
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Levani
> >>>>>>>>>>>
> >>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
> >>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>>>> levani.codes@gmail.com <ma...@gmail.com>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Matthias,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks for the suggestion, makes sense.
> >>>>>>>>>>>> I’ve updated KIP (
> >>>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>> <
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>
> >>>>>>> <
> >>>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>> <
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>
> >>>>>>> <
> >>>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>> <
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>
> >>>>>>> <
> >>>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>> <
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>
> >>>>>>>>> ).
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regards,
> >>>>>>>>>>>> Levani
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
> >>>>> matthias@confluent.io <ma...@confluent.io>
> >>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>> <mailto:
> >>>>> matthias@confluent.io <ma...@confluent.io> <mailto:
> >>>>>>> matthias@confluent.io <ma...@confluent.io>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for driving the KIP.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I agree that users need to be able to specify a partitioning
> >>>>>>> strategy.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Sophie raises a fair point about topic configs and producer
> >>>>>>> configs. My
> >>>>>>>>>>>>> take is, that consider `Repartitioned` as an "extension" to
> >>>>>>> `Produced`,
> >>>>>>>>>>>>> that adds topic configuration, is a good way to think about it and
> >>>>>>> helps
> >>>>>>>>>>>>> to keep the API "clean".
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> With regard to method names. I would prefer to avoid
> >>>>> abbreviations.
> >>>>>>> Can
> >>>>>>>>>>>>> we rename:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Furthermore, it might be good to add some more `static` methods:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
> >>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
> >>>>>>>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
> >>>>>>>>>>>>>> Totally agree. I think in KStream interface it makes sense to
> >>>>> have
> >>>>>>> some duplicate configurations between operators in order to keep API
> >>>>> simple
> >>>>>>> and usable.
> >>>>>>>>>>>>>> Also, as more surface API has, harder it is to have proper
> >>>>>>> backward compatibility.
> >>>>>>>>>>>>>> While initial idea of keeping topic level configs separate was
> >>>>>>> exciting, having Repartitioned class encapsulate some producer level
> >>>>>>> configs makes API more readable.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
> >>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> >>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> >>>>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I think that is a good point about trying to keep producer level
> >>>>>>>>>>>>>>> configurations and (repartition) topic level considerations
> >>>>>>> separate.
> >>>>>>>>>>>>>>> Number of partitions is definitely purely a topic level
> >>>>>>> configuration. But
> >>>>>>>>>>>>>>> on some level, serdes and partitioners are just as much a topic
> >>>>>>>>>>>>>>> configuration as a producer one. You could have two producers
> >>>>>>> configured
> >>>>>>>>>>>>>>> with different serdes and/or partitioners, but if they are
> >>>>>>> writing to the
> >>>>>>>>>>>>>>> same topic the result would be very difficult to part. So in a
> >>>>>>> sense, these
> >>>>>>>>>>>>>>> are configurations of topics in Streams, not just producers.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Another way to think of it: while the Streams API is not always
> >>>>>>> true to
> >>>>>>>>>>>>>>> this, ideally all the relevant configs for an operator are
> >>>>>>> wrapped into a
> >>>>>>>>>>>>>>> single object (in this case, Repartitioned). We could instead
> >>>>>>> split out the
> >>>>>>>>>>>>>>> fields in common with Produced into a separate parameter to keep
> >>>>>>> topic and
> >>>>>>>>>>>>>>> producer level configurations separate, but this increases the
> >>>>>>> API surface
> >>>>>>>>>>>>>>> area by a lot. It's much more straightforward to just say "this
> >>>>> is
> >>>>>>>>>>>>>>> everything that this particular operator needs" without worrying
> >>>>>>> about what
> >>>>>>>>>>>>>>> exactly you're specifying.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I suppose you could alternatively make Produced a field of
> >>>>>>> Repartitioned,
> >>>>>>>>>>>>>>> but I don't think we do this kind of composition elsewhere in
> >>>>>>> Streams at
> >>>>>>>>>>>>>>> the moment
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
> >>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>>>> levani.codes@gmail.com <ma...@gmail.com>> <mailto:
> >>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Bill,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks a lot for the feedback.
> >>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
> >>>>>>> `Repartitioned#partitioner`
> >>>>>>>>>>>>>>>> configuration.
> >>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class for topic level
> >>>>>>>>>>>>>>>> configuration and keep topic level and producer level
> >>>>>>> configurations (such
> >>>>>>>>>>>>>>>> as Produced) separately (see my second email in this thread).
> >>>>>>>>>>>>>>>> But while looking at the semantics of KStream interface, I
> >>>>>>> couldn’t really
> >>>>>>>>>>>>>>>> figure out good operation name for Topic level configuration
> >>>>>>> class and just
> >>>>>>>>>>>>>>>> introducing `Topic` config class was kinda breaking the
> >>>>>>> semantics.
> >>>>>>>>>>>>>>>> So I think having Repartitioned class which encapsulates topic
> >>>>>>> and
> >>>>>>>>>>>>>>>> producer level configurations for internal topics is viable
> >>>>>>> thing to do.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <bbejeck@gmail.com
> >>>>> <ma...@gmail.com>
> >>>>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>> <mailto:
> >>>>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
> >>>>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Lavani,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.  In addition to
> >>>>>>> the reason
> >>>>>>>>>>>>>>>>> provided by John, my reasoning is:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1. Users may want to use something other than hash-based
> >>>>>>> partitioning
> >>>>>>>>>>>>>>>>> 2. Users may wish to partition on something different than the
> >>>>>>> key
> >>>>>>>>>>>>>>>>> without having to change the key.  For example:
> >>>>>>>>>>>>>>>>> 1. A combination of fields in the value in conjunction with
> >>>>>>> the key
> >>>>>>>>>>>>>>>>> 2. Something other than the key
> >>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence
> >>>>> in
> >>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes sense for API
> >>>>>>> consistency.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Just my  2 cents.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>> Bill
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
> >>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>> <mailto:
> >>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi John,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> In my mind it makes sense.
> >>>>>>>>>>>>>>>>>> If we add partitioner configuration to Repartitioned class,
> >>>>>>> with the
> >>>>>>>>>>>>>>>>>> combination of specifying number of partitions for internal
> >>>>>>> topics, user
> >>>>>>>>>>>>>>>>>> will have opportunity to ensure co-partitioning before join
> >>>>>>> operation.
> >>>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
> >>>>>>>>>>>>>>>>>> Wondering what others think about this?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
> >>>>> john@confluent.io <ma...@confluent.io>
> >>>>>>> <mailto:john@confluent.io <ma...@confluent.io>> <mailto:
> >>>>> john@confluent.io <ma...@confluent.io> <mailto:
> >>>>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally
> >>>>>>> sure it
> >>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar is the
> >>>>> rationale
> >>>>>>> for
> >>>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
> >>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
> >>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
> >>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>>>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hey John,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
> >>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your example I would
> >>>>>>> first issue
> >>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner that
> >>>>>>> essentially
> >>>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>> be the same as the topic I want to join with and then do the
> >>>>>>>>>>>>>>>> KStream#join
> >>>>>>>>>>>>>>>>>> with DSL?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
> >>>>>>> john@confluent.io <ma...@confluent.io> <mailto:john@confluent.io
> >>>>> <ma...@confluent.io>> <mailto:john@confluent.io <mailto:
> >>>>> john@confluent.io>
> >>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an option to specify
> >>>>> the
> >>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may
> >>>>>>> get
> >>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the
> >>>>>>> right-side
> >>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
> >>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the
> >>>>>>> same
> >>>>>>>>>>>>>>>>>>>>> strategy.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss something
> >>>>>>> important?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
> >>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
> >>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
> >>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>>>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m
> >>>>> not
> >>>>>>> sure
> >>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>> it yet.
> >>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would
> >>>>>>> need control
> >>>>>>>>>>>>>>>>>> over partitioner for internal topics.
> >>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best
> >>>>>>> how to
> >>>>>>>>>>>>>>>>>> partition data for internal topics.
> >>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for
> >>>>>>> topics
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>> are created by user In advance.
> >>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to
> >>>>>>> specify
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> partitioner.
> >>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess
> >>>>>>> specifying the
> >>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s agreement on
> >>>>> this.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
> >>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> >>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >>>>>>> sophie@confluent.io <ma...@confluent.io>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned
> >>>>>>> would be a
> >>>>>>>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
> >>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to
> >>>>>>> Produced? I'm not
> >>>>>>>>>>>>>>>>>> sure how
> >>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but seems it should
> >>>>> be
> >>>>>>>>>>>>>>>> available
> >>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>> repartition topics.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
> >>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
> >>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
> >>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over
> >>>>>>> the topic
> >>>>>>>>>>>>>>>>>> such as
> >>>>>>>>>>>>>>>>>>>>>>>> num of partitions.
> >>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that
> >>>>> is
> >>>>>>> missing
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>> current DSL design.
> >>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by
> >>>>>>> configuring
> >>>>>>>>>>>>>>>> num
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
> >>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> >>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >>>>>>> sophie@confluent.io <ma...@confluent.io>>>
> >>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me
> >>>>> --
> >>>>>>> for the
> >>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned,
> >>>>>>> will the
> >>>>>>>>>>>>>>>>>> topic be
> >>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case
> >>>>> for
> >>>>>>> the
> >>>>>>>>>>>>>>>>>> signature
> >>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be
> >>>>>>> pre-created? The
> >>>>>>>>>>>>>>>>>> wording
> >>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method
> >>>>>>> will
> >>>>>>>>>>>>>>>>>> auto-create
> >>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>>>>>>>> Sophie
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>> <
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>
> >>>>>>> <
> >>>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>> <
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>
> >>>>>>> <
> >>>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>> <
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>
> >>>>>>> <
> >>>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>> <
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>
> >>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>> <
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>
> >>>>>>> <
> >>>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> )
> >>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
> >>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns
> >>>>> about
> >>>>>>> this KIP.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated
> >>>>>>> confluence
> >>>>>>>>>>>>>>>> page
> >>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>
> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic`
> >>>>>>> class for
> >>>>>>>>>>>>>>>>>> topic
> >>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators like
> >>>>> `groupedBy`.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this
> >>>>> as
> >>>>>>> well.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
> >>>>>>>>>>>>>>>>>> matthias@confluent.io
> >>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for
> >>>>>>> summarizing
> >>>>>>>>>>>>>>>>>>>>>>>> everything.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed
> >>>>>>> already (can't
> >>>>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to
> >>>>>>> refresh the
> >>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard
> >>>>> to
> >>>>>>> the
> >>>>>>>>>>>>>>>>>> distinction
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and
> >>>>> operators
> >>>>>>> that use
> >>>>>>>>>>>>>>>>>>>>>>>>>> user-created
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might
> >>>>> indicate
> >>>>>>> that
> >>>>>>>>>>>>>>>>>> leaving
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses
> >>>>>>> use-defined
> >>>>>>>>>>>>>>>>>> topics) and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an
> >>>>>>> operator that
> >>>>>>>>>>>>>>>>>> manages
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is
> >>>>>>> one
> >>>>>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but
> >>>>>>> sometimes
> >>>>>>>>>>>>>>>> not; a
> >>>>>>>>>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction
> >>>>>>> clearer.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am
> >>>>>>> wondering
> >>>>>>>>>>>>>>>>>> if the
> >>>>>>>>>>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding
> >>>>>>> it is
> >>>>>>>>>>>>>>>>>> semantically
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of
> >>>>>>> others on
> >>>>>>>>>>>>>>>>>> this, too.
> >>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>> am
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow
> >>>>> to
> >>>>>>>>>>>>>>>> configure
> >>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>>>> certain
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we
> >>>>>>> assume that
> >>>>>>>>>>>>>>>>>> a topic
> >>>>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method
> >>>>>>> instead of
> >>>>>>>>>>>>>>>>>> a plain
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other
> >>>>> configs,
> >>>>>>> like
> >>>>>>>>>>>>>>>>>> replication
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the
> >>>>>>> need to
> >>>>>>>>>>>>>>>>>> change the
> >>>>>>>>>>>>>>>>>>>>>>>>>> "main
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like
> >>>>>>> them
> >>>>>>>>>>>>>>>> myself.
> >>>>>>>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe
> >>>>> enhancing
> >>>>>>> Produced
> >>>>>>>>>>>>>>>>>> with num
> >>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach.
> >>>>>>> Let me
> >>>>>>>>>>>>>>>>>> explain
> >>>>>>>>>>>>>>>>>>>>>>>> why:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this
> >>>>>>> configuration,
> >>>>>>>>>>>>>>>>>> this will
> >>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is
> >>>>>>> the final
> >>>>>>>>>>>>>>>>>> sink of
> >>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable
> >>>>> assumption
> >>>>>>> that user
> >>>>>>>>>>>>>>>>>> needs
> >>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that
> >>>>>>> case, having
> >>>>>>>>>>>>>>>>>> num of
> >>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API
> >>>>>>> contract, seems
> >>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that is
> >>>>>>> explicitly for
> >>>>>>>>>>>>>>>>>> producer
> >>>>>>>>>>>>>>>>>>>>>>>> (key
> >>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all
> >>>>>>> are producer
> >>>>>>>>>>>>>>>>>>>>>>>> specific
> >>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
> >>>>>>>>>>>>>>>>>> configuration. And
> >>>>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
> >>>>>>> configurations
> >>>>>>>>>>>>>>>>>> together in
> >>>>>>>>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like
> >>>>>>> Produced
> >>>>>>>>>>>>>>>>>> parameter is
> >>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g
> >>>>> topics
> >>>>>>> created
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>> managed
> >>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we
> >>>>>>> should leave
> >>>>>>>>>>>>>>>>>> it as
> >>>>>>>>>>>>>>>>>>>>>>>> it is
> >>>>>>>>>>>>>>>>>>>>>>>>>> in that case.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we
> >>>>>>> should
> >>>>>>>>>>>>>>>>>> distinguish
> >>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should
> >>>>>>> create and
> >>>>>>>>>>>>>>>>>> manage
> >>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that
> >>>>>>> should be
> >>>>>>>>>>>>>>>>>> created in
> >>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions
> >>>>>>> configuration in
> >>>>>>>>>>>>>>>>>>>>>>>> Produced
> >>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level
> >>>>>>> configuration in
> >>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>>>> class
> >>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in
> >>>>>>> KStream#through - I
> >>>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users
> >>>>>>> possibility to
> >>>>>>>>>>>>>>>>>> specify
> >>>>>>>>>>>>>>>>>>>>>>>> num
> >>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :)
> >>>>> Considering
> >>>>>>> arguments
> >>>>>>>>>>>>>>>>>> against
> >>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two
> >>>>>>> options
> >>>>>>>>>>>>>>>>>> here:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
> >>>>>>> num of
> >>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
> >>>>> will
> >>>>>>> be auto
> >>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
> >>>>>>> Produced<K, V>
> >>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
> >>>>>>> specified num of
> >>>>>>>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce
> >>>>>>> new method
> >>>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this
> >>>>>>> in one of
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> threads)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the
> >>>>>>> following
> >>>>>>>>>>>>>>>> plan:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
> >>>>> Grouped
> >>>>>>> class (as
> >>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
> >>>>>>> KStream#through
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
> >>>>>>> num of
> >>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
> >>>>> will
> >>>>>>> be auto
> >>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
> >>>>>>> Produced<K, V>
> >>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
> >>>>>>> specified num of
> >>>>>>>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
> >>>>> Grouped
> >>>>>>> class (as
> >>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for
> >>>>>>> creating and
> >>>>>>>>>>>>>>>>>> managing
> >>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already
> >>>>>>> discussed in the
> >>>>>>>>>>>>>>>>>> mailing
> >>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were
> >>>>>>> about this
> >>>>>>>>>>>>>>>>>> KIP :(
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> >>>>> levani.codes@gmail.com>>
> >>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around
> >>>>>>> KIP-221. Going
> >>>>>>>>>>>>>>>>>> through
> >>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement
> >>>>>>> around
> >>>>>>>>>>>>>>>>>> usefulness of
> >>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>> feature.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I
> >>>>>>> understood, the
> >>>>>>>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through
> >>>>>>> method
> >>>>>>>>>>>>>>>>>> (essentially
> >>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
> >>>>>>>>>>>>>>>> configuration
> >>>>>>>>>>>>>>>>>>>>>>>> field.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to
> >>>>> control
> >>>>>>>>>>>>>>>>>> parallelism and
> >>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful
> >>>>> operations.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
> >>>>>>>>>>>>>>>>>> KStream#through if
> >>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>
> >>>>>
> >>>
> >>>
> >>
>

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by "Matthias J. Sax" <ma...@confluent.io>.
> it seems like we do want to allow
>> people to optionally specify a partition count as part of this
>> operation, but we don't want that option to _force_ repartitioning

Correct, ie, that is my suggestions.

> "Use P partitions if repartitioning is necessary"

I disagree here, because my reasoning is that:

 - if a user cares about the number of partition, the user wants to
enforce a repartitioning
 - if a user does not case about the number of partitions, we don't need
to provide them a way to pass in a "hint"

Hence, it should be sufficient to support:

// user does not care

  `stream.groupByKey(Grouped)`
  `stream.grouBy(..., Grouped)`

// user does care

  `stream.repartition(Repartitioned).groupByKey()`
  `streams.groupBy(..., Repartitioned)`



-Matthias


On 11/9/19 8:10 PM, John Roesler wrote:
> Thanks for those thoughts, Matthias,
> 
> I find your reasoning about the optimization behavior compelling. The
> `through` operation is very simple and clear to reason about. It just
> passes the data exactly at the specified point in the topology exactly
> through the specified topic. Likewise, if a user invokes a
> `repartition` operator, the simplest behavior is if we just do what
> they asked for.
> 
> Stepping back to think about when optimizations are surprising and
> when they aren't, it occurs to me that we should be free to move
> around repartitions when users have asked to perform some operation
> that implies a repartition, like "change keys, then filter, then
> aggregate". This program requires a repartition, but it could be
> anywhere between the key change and the aggregation. On the other
> hand, if they say, "change keys, then filter, then repartition, then
> aggregate", it seems like they were pretty clear about their desire,
> and we should just take it at face value.
> 
> So, I'm sold on just literally doing a repartition every time they
> invoke the `repartition` operator.
> 
> 
> The "partition count" modifier for `groupBy`/`groupByKey` is more nuanced.
> 
> What you said about `groupByKey` makes sense. If they key hasn't
> actually changed, then we don't need to repartition before
> aggregating. On the other hand, `groupBy` is specifically changing the
> key as part of the grouping operation, so (as you said) we definitely
> have to do a repartition.
> 
> If I'm reading the discussion right, it seems like we do want to allow
> people to optionally specify a partition count as part of this
> operation, but we don't want that option to _force_ repartitioning if
> it's not needed. That last clause is the key. "Use P partitions if
> repartitioning is necessary" is a directive that applies cleanly and
> correctly to both `groupBy` and `groupByKey`. What if we call the
> option `numberOfPartitionsHint`, which along with the "if necessary"
> javadoc, should make it clear that the option won't force a
> repartition, and also gives us enough latitude to still employ the
> optimizer on those repartition topics?
> 
> If we like the idea of expressing it as a "hint" for grouping and a
> "command" for `repartition`, then it seems like it still makes sense
> to keep Grouped and Repartitioned separate, as they would actually
> offer different methods with distinct semantics.
> 
> WDYT?
> 
> Thanks,
> -John
> 
> On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <ma...@confluent.io> wrote:
>>
>> Sorry for late reply.
>>
>> I guess, the question boils down to the intended semantics of
>> `repartition()`. My understanding is as follows:
>>
>> - KS does auto-repartitioning for correctness reasons (using the
>> upstream topic to determine the number of partitions)
>> - KS does auto-repartitioning only for downstream DSL operators like
>> `count()` (eg, a `transform()` does never trigger an auto-repartitioning
>> even if the stream is marked as `repartitioningRequired`).
>> - KS offers `through()` to enforce a repartitioning -- however, the user
>> needs to create the topic manually (with the desired number of partitions).
>>
>> I see two main applications for `repartitioning()`:
>>
>> 1) repartition data before a `transform()` but user does not want to
>> manage the topic
>> 2) scale out a downstream subtopology
>>
>> Hence, I see `repartition()` similar to `through()`: if a users calls
>> it, a repartitining is enforced, with the difference that KS manages the
>> topic and the user does not need to create it.
>>
>> This behavior makes (1) and (2) possible.
>>
>>> I think many users would prefer to just say "if there *is* a repartition
>>> required at this point in the topology, it should
>>> have N partitions"
>>
>> Because of (2), I disagree. Either a user does not care about scaling
>> out, for which case she would not specify the number of partitions. Or a
>> user does care, and hence wants to enforce the scale out. I don't think
>> that any user would say, "maybe scale out".
>>
>> Therefore, the optimizer should never ignore the repartition operation.
>> As a "consequence" (because repartitioning is expensive) a user should
>> make an explicit call to `repartition()` IMHO -- piggybacking an
>> enforced repartitioning into `groupByKey()` seems to be "dangerous"
>> because it might be too subtle and an "optional scaling out" as laid out
>> above does not make sense IMHO.
>>
>> I am also not worried about "over repartitioning" because the result
>> stream would never trigger auto-repartitioning. Only if multiple
>> consecutive calls to `repartition()` are made it could be bad -- but
>> that's the same with `through()`. In the end, there is always some
>> responsibility on the user.
>>
>> Btw, for `.groupBy()` we know that repartitioning will be required,
>> however, for `groupByKey()` it depends if the KStream is marked as
>> `repartitioningRequired`.
>>
>> Hence, for `groupByKey()` it should not be possible for a user to set
>> number of partitions IMHO. For `groupBy()` it's a different story,
>> because calling
>>
>>    `repartition().groupBy()`
>>
>> does not achieve what we want. Hence, allowing users to pass in the
>> number of users partitions into `groupBy()` does actually makes sense,
>> because repartitioning will happen anyway and thus we can piggyback a
>> scaling decision.
>>
>> I think that John has a fair concern about the overloads, however, I am
>> not convinced that using `Grouped` to specify the number of partitions
>> is intuitive. I double checked `Grouped` and `Repartitioned` and both
>> allow to specify a `name` and `keySerde/valueSerde`. Thus, I am
>> wondering if we could bridge the gap between both, if we would make
>> `Repartitioned extends Grouped`? For this case, we only need
>> `groupBy(Grouped)` and a user can pass in both types what seems to make
>> the API quite smooth:
>>
>>   `stream.groupBy(..., Grouped...)`
>>
>>   `stream.groupBy(..., Repartitioned...)`
>>
>>
>> Thoughts?
>>
>>
>> -Matthias
>>
>>
>>
>> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
>>> Hi Sophie,
>>>
>>> Thank you for your reply, very insightful. Looking forward hearing others opinion as well on this.
>>>
>>> Kind regards,
>>> Levani
>>>
>>>
>>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <so...@confluent.io> wrote:
>>>>
>>>>> Personally, I think Matthias’s concern is valid, but on the other hand
>>>> Kafka Streams has already
>>>>> optimizer in place which alters topology independently from user
>>>>
>>>> I agree (with you) and think this is a good way to put it -- we currently
>>>> auto-repartition for the user so
>>>> that they don't have to walk through their entire topology and reason about
>>>> when and where to place a
>>>> `.through` (or the new `.repartition`), so why suddenly force this onto the
>>>> user? How certain are we that
>>>> users will always get this right? It's easy to imagine that during
>>>> development, you write your new app with
>>>> correctly placed repartitions in order to use this new feature. During the
>>>> course of development you end up
>>>> tweaking the topology, but don't remember to review or move the
>>>> repartitioning since you're used to Streams
>>>> doing this for you. If you use only single-partition topics for testing,
>>>> you might not even notice your app is
>>>> spitting out incorrect results!
>>>>
>>>> Anyways, I feel pretty strongly that it would be weird to introduce a new
>>>> feature and say that to use it, you can't take
>>>> advantage of this other feature anymore. Also, is it possible our
>>>> optimization framework could ever include an
>>>> optimized repartitioning strategy that is better than what a user could
>>>> achieve by manually inserting repartitions?
>>>> Do we expect users to have a deep understanding of the best way to
>>>> repartition their particular topology, or is it
>>>> likely they will end up over-repartitioning either due to missed
>>>> optimizations or unnecessary extra repartitions?
>>>> I think many users would prefer to just say "if there *is* a repartition
>>>> required at this point in the topology, it should
>>>> have N partitions"
>>>>
>>>> As to the idea of adding `numberOfPartitions` to Grouped rather than
>>>> adding a new parameter to groupBy, that does seem more in line with the
>>>> current syntax so +1 from me
>>>>
>>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <le...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> While https://github.com/apache/kafka/pull/7170 <
>>>>> https://github.com/apache/kafka/pull/7170> is under review and it’s
>>>>> almost done, I want to resurrect discussion about this KIP to address
>>>>> couple of concerns raised by Matthias and John.
>>>>>
>>>>> As a reminder, idea of the KIP-221 was to allow DSL users control over
>>>>> repartitioning and parallelism of sub-topologies by:
>>>>> 1) Introducing new KStream#repartition operation which is done in
>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>> https://github.com/apache/kafka/pull/7170>
>>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which is planned to
>>>>> be separate PR.
>>>>>
>>>>> While all agree about general implementation and idea behind
>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>> https://github.com/apache/kafka/pull/7170> PR, introducing new
>>>>> KStream#groupBy(Repartitioned) method overload raised some questions during
>>>>> the review.
>>>>> Matthias raised concern that there can be cases when user uses
>>>>> `KStream#groupBy(Repartitioned)` operation, but actual repartitioning may
>>>>> not required, thus configuration passed via `Repartitioned` would never be
>>>>> applied (Matthias, please correct me if I misinterpreted your comment).
>>>>> So instead, if user wants to control parallelism of sub-topologies, he or
>>>>> she should always use `KStream#repartition` operation before groupBy. Full
>>>>> comment can be seen here:
>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
>>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
>>>>>
>>>>> On the same topic, John pointed out that, from API design perspective, we
>>>>> shouldn’t intertwine configuration classes of different operators between
>>>>> one another. So instead of introducing new `KStream#groupBy(Repartitioned)`
>>>>> for specifying number of partitions for internal topic, we should update
>>>>> existing `Grouped` class with `numberOfPartitions` field.
>>>>>
>>>>> Personally, I think Matthias’s concern is valid, but on the other hand
>>>>> Kafka Streams has already optimizer in place which alters topology
>>>>> independently from user. So maybe it makes sense if Kafka Streams,
>>>>> internally would optimize topology in the best way possible, even if in
>>>>> some cases this means ignoring some operator configurations passed by the
>>>>> user. Also, I agree with John about API design semantics. If we go through
>>>>> with the changes for `KStream#groupBy` operation, it makes more sense to
>>>>> add `numberOfPartitions` field to `Grouped` class instead of introducing
>>>>> new `KStream#groupBy(Repartitioned)` method overload.
>>>>>
>>>>> I would really appreciate communities feedback on this.
>>>>>
>>>>> Kind regards,
>>>>> Levani
>>>>>
>>>>>
>>>>>
>>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <so...@confluent.io>
>>>>> wrote:
>>>>>>
>>>>>> Hey Levani,
>>>>>>
>>>>>> I think people are busy with the upcoming 2.4 release, and don't have
>>>>> much
>>>>>> spare time at the
>>>>>> moment. It's kind of a difficult time to get attention on things, but
>>>>> feel
>>>>>> free to pick up something else
>>>>>> to work on in the meantime until things have calmed down a bit!
>>>>>>
>>>>>> Cheers,
>>>>>> Sophie
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>> Sorry for bringing this thread again, but I would like to get some
>>>>>>> attention on this PR: https://github.com/apache/kafka/pull/7170 <
>>>>> https://github.com/apache/kafka/pull/7170> <
>>>>>>> https://github.com/apache/kafka/pull/7170 <
>>>>> https://github.com/apache/kafka/pull/7170>>
>>>>>>> It's been a while now and I would love to move on to other KIPs as well.
>>>>>>> Please let me know if you have any concerns.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Levani
>>>>>>>
>>>>>>>
>>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> Here’s voting thread for this KIP:
>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html> <
>>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Levani
>>>>>>>>
>>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Matthias,
>>>>>>>>>
>>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion on that one.
>>>>>>>>> Agree that avoiding unnecessary method overloads is a good idea.
>>>>>>>>>
>>>>>>>>> Updated KIP
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Levani
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <matthias@confluent.io
>>>>> <ma...@confluent.io>
>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>>>>>
>>>>>>>>>> One question:
>>>>>>>>>>
>>>>>>>>>> Why do we add
>>>>>>>>>>
>>>>>>>>>>> Repartitioned#with(final String name, final int numberOfPartitions)
>>>>>>>>>>
>>>>>>>>>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
>>>>>>>>>> combination with `withName()` and `withNumberOfPartitions()` should
>>>>> be
>>>>>>>>>> sufficient. Users can chain the method calls.
>>>>>>>>>>
>>>>>>>>>> (I think it's valuable to keep the number of overload small if
>>>>>>> possible.)
>>>>>>>>>>
>>>>>>>>>> Otherwise LGTM.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -Matthias
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> Thanks all for your feedback.
>>>>>>>>>>> I started voting procedure for this KIP. If there’re any other
>>>>>>> concerns about this KIP, please let me know.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Levani
>>>>>>>>>>>
>>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>> levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the suggestion, makes sense.
>>>>>>>>>>>> I’ve updated KIP (
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>
>>>>>>> <
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>
>>>>>>> <
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>
>>>>>>> <
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>
>>>>>>>>> ).
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Levani
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
>>>>> matthias@confluent.io <ma...@confluent.io>
>>>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>> <mailto:
>>>>> matthias@confluent.io <ma...@confluent.io> <mailto:
>>>>>>> matthias@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for driving the KIP.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree that users need to be able to specify a partitioning
>>>>>>> strategy.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sophie raises a fair point about topic configs and producer
>>>>>>> configs. My
>>>>>>>>>>>>> take is, that consider `Repartitioned` as an "extension" to
>>>>>>> `Produced`,
>>>>>>>>>>>>> that adds topic configuration, is a good way to think about it and
>>>>>>> helps
>>>>>>>>>>>>> to keep the API "clean".
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> With regard to method names. I would prefer to avoid
>>>>> abbreviations.
>>>>>>> Can
>>>>>>>>>>>>> we rename:
>>>>>>>>>>>>>
>>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>>>>>>>>>>
>>>>>>>>>>>>> Furthermore, it might be good to add some more `static` methods:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>>>>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>> Totally agree. I think in KStream interface it makes sense to
>>>>> have
>>>>>>> some duplicate configurations between operators in order to keep API
>>>>> simple
>>>>>>> and usable.
>>>>>>>>>>>>>> Also, as more surface API has, harder it is to have proper
>>>>>>> backward compatibility.
>>>>>>>>>>>>>> While initial idea of keeping topic level configs separate was
>>>>>>> exciting, having Repartitioned class encapsulate some producer level
>>>>>>> configs makes API more readable.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think that is a good point about trying to keep producer level
>>>>>>>>>>>>>>> configurations and (repartition) topic level considerations
>>>>>>> separate.
>>>>>>>>>>>>>>> Number of partitions is definitely purely a topic level
>>>>>>> configuration. But
>>>>>>>>>>>>>>> on some level, serdes and partitioners are just as much a topic
>>>>>>>>>>>>>>> configuration as a producer one. You could have two producers
>>>>>>> configured
>>>>>>>>>>>>>>> with different serdes and/or partitioners, but if they are
>>>>>>> writing to the
>>>>>>>>>>>>>>> same topic the result would be very difficult to part. So in a
>>>>>>> sense, these
>>>>>>>>>>>>>>> are configurations of topics in Streams, not just producers.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Another way to think of it: while the Streams API is not always
>>>>>>> true to
>>>>>>>>>>>>>>> this, ideally all the relevant configs for an operator are
>>>>>>> wrapped into a
>>>>>>>>>>>>>>> single object (in this case, Repartitioned). We could instead
>>>>>>> split out the
>>>>>>>>>>>>>>> fields in common with Produced into a separate parameter to keep
>>>>>>> topic and
>>>>>>>>>>>>>>> producer level configurations separate, but this increases the
>>>>>>> API surface
>>>>>>>>>>>>>>> area by a lot. It's much more straightforward to just say "this
>>>>> is
>>>>>>>>>>>>>>> everything that this particular operator needs" without worrying
>>>>>>> about what
>>>>>>>>>>>>>>> exactly you're specifying.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I suppose you could alternatively make Produced a field of
>>>>>>> Repartitioned,
>>>>>>>>>>>>>>> but I don't think we do this kind of composition elsewhere in
>>>>>>> Streams at
>>>>>>>>>>>>>>> the moment
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>> levani.codes@gmail.com <ma...@gmail.com>> <mailto:
>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Bill,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
>>>>>>> `Repartitioned#partitioner`
>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class for topic level
>>>>>>>>>>>>>>>> configuration and keep topic level and producer level
>>>>>>> configurations (such
>>>>>>>>>>>>>>>> as Produced) separately (see my second email in this thread).
>>>>>>>>>>>>>>>> But while looking at the semantics of KStream interface, I
>>>>>>> couldn’t really
>>>>>>>>>>>>>>>> figure out good operation name for Topic level configuration
>>>>>>> class and just
>>>>>>>>>>>>>>>> introducing `Topic` config class was kinda breaking the
>>>>>>> semantics.
>>>>>>>>>>>>>>>> So I think having Repartitioned class which encapsulates topic
>>>>>>> and
>>>>>>>>>>>>>>>> producer level configurations for internal topics is viable
>>>>>>> thing to do.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <bbejeck@gmail.com
>>>>> <ma...@gmail.com>
>>>>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>> <mailto:
>>>>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
>>>>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Lavani,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.  In addition to
>>>>>>> the reason
>>>>>>>>>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. Users may want to use something other than hash-based
>>>>>>> partitioning
>>>>>>>>>>>>>>>>> 2. Users may wish to partition on something different than the
>>>>>>> key
>>>>>>>>>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>>>>>>>>> 1. A combination of fields in the value in conjunction with
>>>>>>> the key
>>>>>>>>>>>>>>>>> 2. Something other than the key
>>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence
>>>>> in
>>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes sense for API
>>>>>>> consistency.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Just my  2 cents.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>> <mailto:
>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>>>>>>>>>> If we add partitioner configuration to Repartitioned class,
>>>>>>> with the
>>>>>>>>>>>>>>>>>> combination of specifying number of partitions for internal
>>>>>>> topics, user
>>>>>>>>>>>>>>>>>> will have opportunity to ensure co-partitioning before join
>>>>>>> operation.
>>>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
>>>>> john@confluent.io <ma...@confluent.io>
>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>> <mailto:
>>>>> john@confluent.io <ma...@confluent.io> <mailto:
>>>>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally
>>>>>>> sure it
>>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar is the
>>>>> rationale
>>>>>>> for
>>>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hey John,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your example I would
>>>>>>> first issue
>>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner that
>>>>>>> essentially
>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>> be the same as the topic I want to join with and then do the
>>>>>>>>>>>>>>>> KStream#join
>>>>>>>>>>>>>>>>>> with DSL?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
>>>>>>> john@confluent.io <ma...@confluent.io> <mailto:john@confluent.io
>>>>> <ma...@confluent.io>> <mailto:john@confluent.io <mailto:
>>>>> john@confluent.io>
>>>>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an option to specify
>>>>> the
>>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may
>>>>>>> get
>>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the
>>>>>>> right-side
>>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
>>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the
>>>>>>> same
>>>>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss something
>>>>>>> important?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
>>>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m
>>>>> not
>>>>>>> sure
>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would
>>>>>>> need control
>>>>>>>>>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best
>>>>>>> how to
>>>>>>>>>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for
>>>>>>> topics
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to
>>>>>>> specify
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> partitioner.
>>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess
>>>>>>> specifying the
>>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s agreement on
>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>> sophie@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned
>>>>>>> would be a
>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
>>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to
>>>>>>> Produced? I'm not
>>>>>>>>>>>>>>>>>> sure how
>>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but seems it should
>>>>> be
>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
>>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over
>>>>>>> the topic
>>>>>>>>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that
>>>>> is
>>>>>>> missing
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by
>>>>>>> configuring
>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>>>> sophie@confluent.io <ma...@confluent.io>>>
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me
>>>>> --
>>>>>>> for the
>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned,
>>>>>>> will the
>>>>>>>>>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case
>>>>> for
>>>>>>> the
>>>>>>>>>>>>>>>>>> signature
>>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be
>>>>>>> pre-created? The
>>>>>>>>>>>>>>>>>> wording
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method
>>>>>>> will
>>>>>>>>>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>
>>>>>>> <
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>
>>>>>>> <
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>
>>>>>>> <
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>
>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> <
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>
>>>>>>> <
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns
>>>>> about
>>>>>>> this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated
>>>>>>> confluence
>>>>>>>>>>>>>>>> page
>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic`
>>>>>>> class for
>>>>>>>>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators like
>>>>> `groupedBy`.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this
>>>>> as
>>>>>>> well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
>>>>>>>>>>>>>>>>>> matthias@confluent.io
>>>>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for
>>>>>>> summarizing
>>>>>>>>>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed
>>>>>>> already (can't
>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to
>>>>>>> refresh the
>>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard
>>>>> to
>>>>>>> the
>>>>>>>>>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and
>>>>> operators
>>>>>>> that use
>>>>>>>>>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might
>>>>> indicate
>>>>>>> that
>>>>>>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses
>>>>>>> use-defined
>>>>>>>>>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an
>>>>>>> operator that
>>>>>>>>>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is
>>>>>>> one
>>>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but
>>>>>>> sometimes
>>>>>>>>>>>>>>>> not; a
>>>>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction
>>>>>>> clearer.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am
>>>>>>> wondering
>>>>>>>>>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding
>>>>>>> it is
>>>>>>>>>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of
>>>>>>> others on
>>>>>>>>>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow
>>>>> to
>>>>>>>>>>>>>>>> configure
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we
>>>>>>> assume that
>>>>>>>>>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method
>>>>>>> instead of
>>>>>>>>>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other
>>>>> configs,
>>>>>>> like
>>>>>>>>>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the
>>>>>>> need to
>>>>>>>>>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like
>>>>>>> them
>>>>>>>>>>>>>>>> myself.
>>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe
>>>>> enhancing
>>>>>>> Produced
>>>>>>>>>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach.
>>>>>>> Let me
>>>>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this
>>>>>>> configuration,
>>>>>>>>>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is
>>>>>>> the final
>>>>>>>>>>>>>>>>>> sink of
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable
>>>>> assumption
>>>>>>> that user
>>>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that
>>>>>>> case, having
>>>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API
>>>>>>> contract, seems
>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that is
>>>>>>> explicitly for
>>>>>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all
>>>>>>> are producer
>>>>>>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
>>>>>>>>>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
>>>>>>> configurations
>>>>>>>>>>>>>>>>>> together in
>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like
>>>>>>> Produced
>>>>>>>>>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g
>>>>> topics
>>>>>>> created
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we
>>>>>>> should leave
>>>>>>>>>>>>>>>>>> it as
>>>>>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we
>>>>>>> should
>>>>>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should
>>>>>>> create and
>>>>>>>>>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that
>>>>>>> should be
>>>>>>>>>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions
>>>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level
>>>>>>> configuration in
>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in
>>>>>>> KStream#through - I
>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users
>>>>>>> possibility to
>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :)
>>>>> Considering
>>>>>>> arguments
>>>>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two
>>>>>>> options
>>>>>>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>>>>>>> num of
>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
>>>>> will
>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce
>>>>>>> new method
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this
>>>>>>> in one of
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the
>>>>>>> following
>>>>>>>>>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
>>>>> Grouped
>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
>>>>>>> KStream#through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>>>>>>> num of
>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
>>>>> will
>>>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>>>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>>>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
>>>>> Grouped
>>>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for
>>>>>>> creating and
>>>>>>>>>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already
>>>>>>> discussed in the
>>>>>>>>>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were
>>>>>>> about this
>>>>>>>>>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>>>> levani.codes@gmail.com>>
>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around
>>>>>>> KIP-221. Going
>>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement
>>>>>>> around
>>>>>>>>>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I
>>>>>>> understood, the
>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through
>>>>>>> method
>>>>>>>>>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to
>>>>> control
>>>>>>>>>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful
>>>>> operations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
>>>>>>>>>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>
>>>>>
>>>
>>>
>>


Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by John Roesler <jo...@confluent.io>.
Thanks for those thoughts, Matthias,

I find your reasoning about the optimization behavior compelling. The
`through` operation is very simple and clear to reason about. It just
passes the data exactly at the specified point in the topology exactly
through the specified topic. Likewise, if a user invokes a
`repartition` operator, the simplest behavior is if we just do what
they asked for.

Stepping back to think about when optimizations are surprising and
when they aren't, it occurs to me that we should be free to move
around repartitions when users have asked to perform some operation
that implies a repartition, like "change keys, then filter, then
aggregate". This program requires a repartition, but it could be
anywhere between the key change and the aggregation. On the other
hand, if they say, "change keys, then filter, then repartition, then
aggregate", it seems like they were pretty clear about their desire,
and we should just take it at face value.

So, I'm sold on just literally doing a repartition every time they
invoke the `repartition` operator.


The "partition count" modifier for `groupBy`/`groupByKey` is more nuanced.

What you said about `groupByKey` makes sense. If they key hasn't
actually changed, then we don't need to repartition before
aggregating. On the other hand, `groupBy` is specifically changing the
key as part of the grouping operation, so (as you said) we definitely
have to do a repartition.

If I'm reading the discussion right, it seems like we do want to allow
people to optionally specify a partition count as part of this
operation, but we don't want that option to _force_ repartitioning if
it's not needed. That last clause is the key. "Use P partitions if
repartitioning is necessary" is a directive that applies cleanly and
correctly to both `groupBy` and `groupByKey`. What if we call the
option `numberOfPartitionsHint`, which along with the "if necessary"
javadoc, should make it clear that the option won't force a
repartition, and also gives us enough latitude to still employ the
optimizer on those repartition topics?

If we like the idea of expressing it as a "hint" for grouping and a
"command" for `repartition`, then it seems like it still makes sense
to keep Grouped and Repartitioned separate, as they would actually
offer different methods with distinct semantics.

WDYT?

Thanks,
-John

On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <ma...@confluent.io> wrote:
>
> Sorry for late reply.
>
> I guess, the question boils down to the intended semantics of
> `repartition()`. My understanding is as follows:
>
> - KS does auto-repartitioning for correctness reasons (using the
> upstream topic to determine the number of partitions)
> - KS does auto-repartitioning only for downstream DSL operators like
> `count()` (eg, a `transform()` does never trigger an auto-repartitioning
> even if the stream is marked as `repartitioningRequired`).
> - KS offers `through()` to enforce a repartitioning -- however, the user
> needs to create the topic manually (with the desired number of partitions).
>
> I see two main applications for `repartitioning()`:
>
> 1) repartition data before a `transform()` but user does not want to
> manage the topic
> 2) scale out a downstream subtopology
>
> Hence, I see `repartition()` similar to `through()`: if a users calls
> it, a repartitining is enforced, with the difference that KS manages the
> topic and the user does not need to create it.
>
> This behavior makes (1) and (2) possible.
>
> > I think many users would prefer to just say "if there *is* a repartition
> > required at this point in the topology, it should
> > have N partitions"
>
> Because of (2), I disagree. Either a user does not care about scaling
> out, for which case she would not specify the number of partitions. Or a
> user does care, and hence wants to enforce the scale out. I don't think
> that any user would say, "maybe scale out".
>
> Therefore, the optimizer should never ignore the repartition operation.
> As a "consequence" (because repartitioning is expensive) a user should
> make an explicit call to `repartition()` IMHO -- piggybacking an
> enforced repartitioning into `groupByKey()` seems to be "dangerous"
> because it might be too subtle and an "optional scaling out" as laid out
> above does not make sense IMHO.
>
> I am also not worried about "over repartitioning" because the result
> stream would never trigger auto-repartitioning. Only if multiple
> consecutive calls to `repartition()` are made it could be bad -- but
> that's the same with `through()`. In the end, there is always some
> responsibility on the user.
>
> Btw, for `.groupBy()` we know that repartitioning will be required,
> however, for `groupByKey()` it depends if the KStream is marked as
> `repartitioningRequired`.
>
> Hence, for `groupByKey()` it should not be possible for a user to set
> number of partitions IMHO. For `groupBy()` it's a different story,
> because calling
>
>    `repartition().groupBy()`
>
> does not achieve what we want. Hence, allowing users to pass in the
> number of users partitions into `groupBy()` does actually makes sense,
> because repartitioning will happen anyway and thus we can piggyback a
> scaling decision.
>
> I think that John has a fair concern about the overloads, however, I am
> not convinced that using `Grouped` to specify the number of partitions
> is intuitive. I double checked `Grouped` and `Repartitioned` and both
> allow to specify a `name` and `keySerde/valueSerde`. Thus, I am
> wondering if we could bridge the gap between both, if we would make
> `Repartitioned extends Grouped`? For this case, we only need
> `groupBy(Grouped)` and a user can pass in both types what seems to make
> the API quite smooth:
>
>   `stream.groupBy(..., Grouped...)`
>
>   `stream.groupBy(..., Repartitioned...)`
>
>
> Thoughts?
>
>
> -Matthias
>
>
>
> On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
> > Hi Sophie,
> >
> > Thank you for your reply, very insightful. Looking forward hearing others opinion as well on this.
> >
> > Kind regards,
> > Levani
> >
> >
> >> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <so...@confluent.io> wrote:
> >>
> >>> Personally, I think Matthias’s concern is valid, but on the other hand
> >> Kafka Streams has already
> >>> optimizer in place which alters topology independently from user
> >>
> >> I agree (with you) and think this is a good way to put it -- we currently
> >> auto-repartition for the user so
> >> that they don't have to walk through their entire topology and reason about
> >> when and where to place a
> >> `.through` (or the new `.repartition`), so why suddenly force this onto the
> >> user? How certain are we that
> >> users will always get this right? It's easy to imagine that during
> >> development, you write your new app with
> >> correctly placed repartitions in order to use this new feature. During the
> >> course of development you end up
> >> tweaking the topology, but don't remember to review or move the
> >> repartitioning since you're used to Streams
> >> doing this for you. If you use only single-partition topics for testing,
> >> you might not even notice your app is
> >> spitting out incorrect results!
> >>
> >> Anyways, I feel pretty strongly that it would be weird to introduce a new
> >> feature and say that to use it, you can't take
> >> advantage of this other feature anymore. Also, is it possible our
> >> optimization framework could ever include an
> >> optimized repartitioning strategy that is better than what a user could
> >> achieve by manually inserting repartitions?
> >> Do we expect users to have a deep understanding of the best way to
> >> repartition their particular topology, or is it
> >> likely they will end up over-repartitioning either due to missed
> >> optimizations or unnecessary extra repartitions?
> >> I think many users would prefer to just say "if there *is* a repartition
> >> required at this point in the topology, it should
> >> have N partitions"
> >>
> >> As to the idea of adding `numberOfPartitions` to Grouped rather than
> >> adding a new parameter to groupBy, that does seem more in line with the
> >> current syntax so +1 from me
> >>
> >> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <le...@gmail.com>
> >> wrote:
> >>
> >>> Hello all,
> >>>
> >>> While https://github.com/apache/kafka/pull/7170 <
> >>> https://github.com/apache/kafka/pull/7170> is under review and it’s
> >>> almost done, I want to resurrect discussion about this KIP to address
> >>> couple of concerns raised by Matthias and John.
> >>>
> >>> As a reminder, idea of the KIP-221 was to allow DSL users control over
> >>> repartitioning and parallelism of sub-topologies by:
> >>> 1) Introducing new KStream#repartition operation which is done in
> >>> https://github.com/apache/kafka/pull/7170 <
> >>> https://github.com/apache/kafka/pull/7170>
> >>> 2) Add new KStream#groupBy(Repartitioned) operation, which is planned to
> >>> be separate PR.
> >>>
> >>> While all agree about general implementation and idea behind
> >>> https://github.com/apache/kafka/pull/7170 <
> >>> https://github.com/apache/kafka/pull/7170> PR, introducing new
> >>> KStream#groupBy(Repartitioned) method overload raised some questions during
> >>> the review.
> >>> Matthias raised concern that there can be cases when user uses
> >>> `KStream#groupBy(Repartitioned)` operation, but actual repartitioning may
> >>> not required, thus configuration passed via `Repartitioned` would never be
> >>> applied (Matthias, please correct me if I misinterpreted your comment).
> >>> So instead, if user wants to control parallelism of sub-topologies, he or
> >>> she should always use `KStream#repartition` operation before groupBy. Full
> >>> comment can be seen here:
> >>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
> >>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
> >>>
> >>> On the same topic, John pointed out that, from API design perspective, we
> >>> shouldn’t intertwine configuration classes of different operators between
> >>> one another. So instead of introducing new `KStream#groupBy(Repartitioned)`
> >>> for specifying number of partitions for internal topic, we should update
> >>> existing `Grouped` class with `numberOfPartitions` field.
> >>>
> >>> Personally, I think Matthias’s concern is valid, but on the other hand
> >>> Kafka Streams has already optimizer in place which alters topology
> >>> independently from user. So maybe it makes sense if Kafka Streams,
> >>> internally would optimize topology in the best way possible, even if in
> >>> some cases this means ignoring some operator configurations passed by the
> >>> user. Also, I agree with John about API design semantics. If we go through
> >>> with the changes for `KStream#groupBy` operation, it makes more sense to
> >>> add `numberOfPartitions` field to `Grouped` class instead of introducing
> >>> new `KStream#groupBy(Repartitioned)` method overload.
> >>>
> >>> I would really appreciate communities feedback on this.
> >>>
> >>> Kind regards,
> >>> Levani
> >>>
> >>>
> >>>
> >>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <so...@confluent.io>
> >>> wrote:
> >>>>
> >>>> Hey Levani,
> >>>>
> >>>> I think people are busy with the upcoming 2.4 release, and don't have
> >>> much
> >>>> spare time at the
> >>>> moment. It's kind of a difficult time to get attention on things, but
> >>> feel
> >>>> free to pick up something else
> >>>> to work on in the meantime until things have calmed down a bit!
> >>>>
> >>>> Cheers,
> >>>> Sophie
> >>>>
> >>>>
> >>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
> >>> levani.codes@gmail.com <ma...@gmail.com>>
> >>>> wrote:
> >>>>
> >>>>> Hello all,
> >>>>>
> >>>>> Sorry for bringing this thread again, but I would like to get some
> >>>>> attention on this PR: https://github.com/apache/kafka/pull/7170 <
> >>> https://github.com/apache/kafka/pull/7170> <
> >>>>> https://github.com/apache/kafka/pull/7170 <
> >>> https://github.com/apache/kafka/pull/7170>>
> >>>>> It's been a while now and I would love to move on to other KIPs as well.
> >>>>> Please let me know if you have any concerns.
> >>>>>
> >>>>> Regards,
> >>>>> Levani
> >>>>>
> >>>>>
> >>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
> >>> levani.codes@gmail.com <ma...@gmail.com>>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> Here’s voting thread for this KIP:
> >>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
> >>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html> <
> >>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
> >>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>>
> >>>>>>
> >>>>>> Regards,
> >>>>>> Levani
> >>>>>>
> >>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
> >>> levani.codes@gmail.com <ma...@gmail.com>
> >>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>> wrote:
> >>>>>>>
> >>>>>>> Hi Matthias,
> >>>>>>>
> >>>>>>> Thanks for the suggestion. I Don’t have strong opinion on that one.
> >>>>>>> Agree that avoiding unnecessary method overloads is a good idea.
> >>>>>>>
> >>>>>>> Updated KIP
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Levani
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <matthias@confluent.io
> >>> <ma...@confluent.io>
> >>>>> <mailto:matthias@confluent.io <ma...@confluent.io>>> wrote:
> >>>>>>>>
> >>>>>>>> One question:
> >>>>>>>>
> >>>>>>>> Why do we add
> >>>>>>>>
> >>>>>>>>> Repartitioned#with(final String name, final int numberOfPartitions)
> >>>>>>>>
> >>>>>>>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
> >>>>>>>> combination with `withName()` and `withNumberOfPartitions()` should
> >>> be
> >>>>>>>> sufficient. Users can chain the method calls.
> >>>>>>>>
> >>>>>>>> (I think it's valuable to keep the number of overload small if
> >>>>> possible.)
> >>>>>>>>
> >>>>>>>> Otherwise LGTM.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> -Matthias
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
> >>>>>>>>> Hello,
> >>>>>>>>>
> >>>>>>>>> Thanks all for your feedback.
> >>>>>>>>> I started voting procedure for this KIP. If there’re any other
> >>>>> concerns about this KIP, please let me know.
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Levani
> >>>>>>>>>
> >>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
> >>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>> levani.codes@gmail.com <ma...@gmail.com>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi Matthias,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for the suggestion, makes sense.
> >>>>>>>>>> I’ve updated KIP (
> >>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>> <
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>
> >>>>> <
> >>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>> <
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>
> >>>>> <
> >>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>> <
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>
> >>>>> <
> >>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>> <
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>
> >>>>>>> ).
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Levani
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
> >>> matthias@confluent.io <ma...@confluent.io>
> >>>>> <mailto:matthias@confluent.io <ma...@confluent.io>> <mailto:
> >>> matthias@confluent.io <ma...@confluent.io> <mailto:
> >>>>> matthias@confluent.io <ma...@confluent.io>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for driving the KIP.
> >>>>>>>>>>>
> >>>>>>>>>>> I agree that users need to be able to specify a partitioning
> >>>>> strategy.
> >>>>>>>>>>>
> >>>>>>>>>>> Sophie raises a fair point about topic configs and producer
> >>>>> configs. My
> >>>>>>>>>>> take is, that consider `Repartitioned` as an "extension" to
> >>>>> `Produced`,
> >>>>>>>>>>> that adds topic configuration, is a good way to think about it and
> >>>>> helps
> >>>>>>>>>>> to keep the API "clean".
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> With regard to method names. I would prefer to avoid
> >>> abbreviations.
> >>>>> Can
> >>>>>>>>>>> we rename:
> >>>>>>>>>>>
> >>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
> >>>>>>>>>>>
> >>>>>>>>>>> Furthermore, it might be good to add some more `static` methods:
> >>>>>>>>>>>
> >>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
> >>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
> >>>>>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> -Matthias
> >>>>>>>>>>>
> >>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
> >>>>>>>>>>>> Totally agree. I think in KStream interface it makes sense to
> >>> have
> >>>>> some duplicate configurations between operators in order to keep API
> >>> simple
> >>>>> and usable.
> >>>>>>>>>>>> Also, as more surface API has, harder it is to have proper
> >>>>> backward compatibility.
> >>>>>>>>>>>> While initial idea of keeping topic level configs separate was
> >>>>> exciting, having Repartitioned class encapsulate some producer level
> >>>>> configs makes API more readable.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regards,
> >>>>>>>>>>>> Levani
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
> >>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> >>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> >>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I think that is a good point about trying to keep producer level
> >>>>>>>>>>>>> configurations and (repartition) topic level considerations
> >>>>> separate.
> >>>>>>>>>>>>> Number of partitions is definitely purely a topic level
> >>>>> configuration. But
> >>>>>>>>>>>>> on some level, serdes and partitioners are just as much a topic
> >>>>>>>>>>>>> configuration as a producer one. You could have two producers
> >>>>> configured
> >>>>>>>>>>>>> with different serdes and/or partitioners, but if they are
> >>>>> writing to the
> >>>>>>>>>>>>> same topic the result would be very difficult to part. So in a
> >>>>> sense, these
> >>>>>>>>>>>>> are configurations of topics in Streams, not just producers.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Another way to think of it: while the Streams API is not always
> >>>>> true to
> >>>>>>>>>>>>> this, ideally all the relevant configs for an operator are
> >>>>> wrapped into a
> >>>>>>>>>>>>> single object (in this case, Repartitioned). We could instead
> >>>>> split out the
> >>>>>>>>>>>>> fields in common with Produced into a separate parameter to keep
> >>>>> topic and
> >>>>>>>>>>>>> producer level configurations separate, but this increases the
> >>>>> API surface
> >>>>>>>>>>>>> area by a lot. It's much more straightforward to just say "this
> >>> is
> >>>>>>>>>>>>> everything that this particular operator needs" without worrying
> >>>>> about what
> >>>>>>>>>>>>> exactly you're specifying.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I suppose you could alternatively make Produced a field of
> >>>>> Repartitioned,
> >>>>>>>>>>>>> but I don't think we do this kind of composition elsewhere in
> >>>>> Streams at
> >>>>>>>>>>>>> the moment
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
> >>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>> levani.codes@gmail.com <ma...@gmail.com>> <mailto:
> >>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Bill,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks a lot for the feedback.
> >>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
> >>>>> `Repartitioned#partitioner`
> >>>>>>>>>>>>>> configuration.
> >>>>>>>>>>>>>> In the beginning, I wanted to introduce a class for topic level
> >>>>>>>>>>>>>> configuration and keep topic level and producer level
> >>>>> configurations (such
> >>>>>>>>>>>>>> as Produced) separately (see my second email in this thread).
> >>>>>>>>>>>>>> But while looking at the semantics of KStream interface, I
> >>>>> couldn’t really
> >>>>>>>>>>>>>> figure out good operation name for Topic level configuration
> >>>>> class and just
> >>>>>>>>>>>>>> introducing `Topic` config class was kinda breaking the
> >>>>> semantics.
> >>>>>>>>>>>>>> So I think having Repartitioned class which encapsulates topic
> >>>>> and
> >>>>>>>>>>>>>> producer level configurations for internal topics is viable
> >>>>> thing to do.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <bbejeck@gmail.com
> >>> <ma...@gmail.com>
> >>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>> <mailto:
> >>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
> >>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Lavani,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.  In addition to
> >>>>> the reason
> >>>>>>>>>>>>>>> provided by John, my reasoning is:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1. Users may want to use something other than hash-based
> >>>>> partitioning
> >>>>>>>>>>>>>>> 2. Users may wish to partition on something different than the
> >>>>> key
> >>>>>>>>>>>>>>> without having to change the key.  For example:
> >>>>>>>>>>>>>>> 1. A combination of fields in the value in conjunction with
> >>>>> the key
> >>>>>>>>>>>>>>> 2. Something other than the key
> >>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence
> >>> in
> >>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes sense for API
> >>>>> consistency.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Just my  2 cents.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>> Bill
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
> >>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >>> <mailto:levani.codes@gmail.com <ma...@gmail.com>> <mailto:
> >>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi John,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> In my mind it makes sense.
> >>>>>>>>>>>>>>>> If we add partitioner configuration to Repartitioned class,
> >>>>> with the
> >>>>>>>>>>>>>>>> combination of specifying number of partitions for internal
> >>>>> topics, user
> >>>>>>>>>>>>>>>> will have opportunity to ensure co-partitioning before join
> >>>>> operation.
> >>>>>>>>>>>>>>>> I think this can be quite powerful feature.
> >>>>>>>>>>>>>>>> Wondering what others think about this?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
> >>> john@confluent.io <ma...@confluent.io>
> >>>>> <mailto:john@confluent.io <ma...@confluent.io>> <mailto:
> >>> john@confluent.io <ma...@confluent.io> <mailto:
> >>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally
> >>>>> sure it
> >>>>>>>>>>>>>>>>> makes sense, but I believe something similar is the
> >>> rationale
> >>>>> for
> >>>>>>>>>>>>>>>>> having the partitioner option in Produced.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
> >>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
> >>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
> >>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hey John,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
> >>>>>>>>>>>>>>>>>> Do I understand this correctly, in your example I would
> >>>>> first issue
> >>>>>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner that
> >>>>> essentially
> >>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>> be the same as the topic I want to join with and then do the
> >>>>>>>>>>>>>> KStream#join
> >>>>>>>>>>>>>>>> with DSL?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
> >>>>> john@confluent.io <ma...@confluent.io> <mailto:john@confluent.io
> >>> <ma...@confluent.io>> <mailto:john@confluent.io <mailto:
> >>> john@confluent.io>
> >>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I think it might be useful to have an option to specify
> >>> the
> >>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may
> >>>>> get
> >>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the
> >>>>> right-side
> >>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
> >>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the
> >>>>> same
> >>>>>>>>>>>>>>>>>>> strategy.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss something
> >>>>> important?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
> >>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
> >>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
> >>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m
> >>> not
> >>>>> sure
> >>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>> it yet.
> >>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would
> >>>>> need control
> >>>>>>>>>>>>>>>> over partitioner for internal topics.
> >>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best
> >>>>> how to
> >>>>>>>>>>>>>>>> partition data for internal topics.
> >>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for
> >>>>> topics
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> are created by user In advance.
> >>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to
> >>>>> specify
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> partitioner.
> >>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess
> >>>>> specifying the
> >>>>>>>>>>>>>>>> partitioner can be added as well if there’s agreement on
> >>> this.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
> >>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> >>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >>>>> sophie@confluent.io <ma...@confluent.io>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned
> >>>>> would be a
> >>>>>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
> >>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to
> >>>>> Produced? I'm not
> >>>>>>>>>>>>>>>> sure how
> >>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but seems it should
> >>> be
> >>>>>>>>>>>>>> available
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>> repartition topics.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
> >>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
> >>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
> >>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
> >>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over
> >>>>> the topic
> >>>>>>>>>>>>>>>> such as
> >>>>>>>>>>>>>>>>>>>>>> num of partitions.
> >>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that
> >>> is
> >>>>> missing
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>> current DSL design.
> >>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by
> >>>>> configuring
> >>>>>>>>>>>>>> num
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
> >>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> >>> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >>>>> sophie@confluent.io <ma...@confluent.io>>>
> >>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me
> >>> --
> >>>>> for the
> >>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned,
> >>>>> will the
> >>>>>>>>>>>>>>>> topic be
> >>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case
> >>> for
> >>>>> the
> >>>>>>>>>>>>>>>> signature
> >>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be
> >>>>> pre-created? The
> >>>>>>>>>>>>>>>> wording
> >>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method
> >>>>> will
> >>>>>>>>>>>>>>>> auto-create
> >>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>>>>>> Sophie
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
> >>> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>> <
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>
> >>>>> <
> >>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>> <
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>
> >>>>> <
> >>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>> <
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>
> >>>>> <
> >>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>> <
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>
> >>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>> <
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>
> >>>>> <
> >>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> )
> >>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
> >>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns
> >>> about
> >>>>> this KIP.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated
> >>>>> confluence
> >>>>>>>>>>>>>> page
> >>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>> the new proposal
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic`
> >>>>> class for
> >>>>>>>>>>>>>>>> topic
> >>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators like
> >>> `groupedBy`.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this
> >>> as
> >>>>> well.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
> >>>>>>>>>>>>>>>> matthias@confluent.io
> >>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for
> >>>>> summarizing
> >>>>>>>>>>>>>>>>>>>>>> everything.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed
> >>>>> already (can't
> >>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to
> >>>>> refresh the
> >>>>>>>>>>>>>>>>>>>>>>>> discussion.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard
> >>> to
> >>>>> the
> >>>>>>>>>>>>>>>> distinction
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and
> >>> operators
> >>>>> that use
> >>>>>>>>>>>>>>>>>>>>>>>> user-created
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might
> >>> indicate
> >>>>> that
> >>>>>>>>>>>>>>>> leaving
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses
> >>>>> use-defined
> >>>>>>>>>>>>>>>> topics) and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an
> >>>>> operator that
> >>>>>>>>>>>>>>>> manages
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is
> >>>>> one
> >>>>>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but
> >>>>> sometimes
> >>>>>>>>>>>>>> not; a
> >>>>>>>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction
> >>>>> clearer.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am
> >>>>> wondering
> >>>>>>>>>>>>>>>> if the
> >>>>>>>>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding
> >>>>> it is
> >>>>>>>>>>>>>>>> semantically
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of
> >>>>> others on
> >>>>>>>>>>>>>>>> this, too.
> >>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>> am
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow
> >>> to
> >>>>>>>>>>>>>> configure
> >>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>> certain
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we
> >>>>> assume that
> >>>>>>>>>>>>>>>> a topic
> >>>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method
> >>>>> instead of
> >>>>>>>>>>>>>>>> a plain
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other
> >>> configs,
> >>>>> like
> >>>>>>>>>>>>>>>> replication
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the
> >>>>> need to
> >>>>>>>>>>>>>>>> change the
> >>>>>>>>>>>>>>>>>>>>>>>> "main
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like
> >>>>> them
> >>>>>>>>>>>>>> myself.
> >>>>>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe
> >>> enhancing
> >>>>> Produced
> >>>>>>>>>>>>>>>> with num
> >>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach.
> >>>>> Let me
> >>>>>>>>>>>>>>>> explain
> >>>>>>>>>>>>>>>>>>>>>> why:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this
> >>>>> configuration,
> >>>>>>>>>>>>>>>> this will
> >>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is
> >>>>> the final
> >>>>>>>>>>>>>>>> sink of
> >>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable
> >>> assumption
> >>>>> that user
> >>>>>>>>>>>>>>>> needs
> >>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that
> >>>>> case, having
> >>>>>>>>>>>>>>>> num of
> >>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API
> >>>>> contract, seems
> >>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that is
> >>>>> explicitly for
> >>>>>>>>>>>>>>>> producer
> >>>>>>>>>>>>>>>>>>>>>> (key
> >>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all
> >>>>> are producer
> >>>>>>>>>>>>>>>>>>>>>> specific
> >>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
> >>>>>>>>>>>>>>>> configuration. And
> >>>>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
> >>>>> configurations
> >>>>>>>>>>>>>>>> together in
> >>>>>>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like
> >>>>> Produced
> >>>>>>>>>>>>>>>> parameter is
> >>>>>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g
> >>> topics
> >>>>> created
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>> managed
> >>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we
> >>>>> should leave
> >>>>>>>>>>>>>>>> it as
> >>>>>>>>>>>>>>>>>>>>>> it is
> >>>>>>>>>>>>>>>>>>>>>>>> in that case.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we
> >>>>> should
> >>>>>>>>>>>>>>>> distinguish
> >>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should
> >>>>> create and
> >>>>>>>>>>>>>>>> manage
> >>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that
> >>>>> should be
> >>>>>>>>>>>>>>>> created in
> >>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions
> >>>>> configuration in
> >>>>>>>>>>>>>>>>>>>>>> Produced
> >>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level
> >>>>> configuration in
> >>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>>> class
> >>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in
> >>>>> KStream#through - I
> >>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users
> >>>>> possibility to
> >>>>>>>>>>>>>>>> specify
> >>>>>>>>>>>>>>>>>>>>>> num
> >>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :)
> >>> Considering
> >>>>> arguments
> >>>>>>>>>>>>>>>> against
> >>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two
> >>>>> options
> >>>>>>>>>>>>>>>> here:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
> >>>>> num of
> >>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
> >>> will
> >>>>> be auto
> >>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
> >>>>> Produced<K, V>
> >>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
> >>>>> specified num of
> >>>>>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce
> >>>>> new method
> >>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this
> >>>>> in one of
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> threads)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the
> >>>>> following
> >>>>>>>>>>>>>> plan:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
> >>> Grouped
> >>>>> class (as
> >>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
> >>>>> KStream#through
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
> >>>>> num of
> >>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
> >>> will
> >>>>> be auto
> >>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
> >>>>> Produced<K, V>
> >>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
> >>>>> specified num of
> >>>>>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
> >>> Grouped
> >>>>> class (as
> >>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for
> >>>>> creating and
> >>>>>>>>>>>>>>>> managing
> >>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already
> >>>>> discussed in the
> >>>>>>>>>>>>>>>> mailing
> >>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were
> >>>>> about this
> >>>>>>>>>>>>>>>> KIP :(
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> >>> levani.codes@gmail.com>>
> >>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around
> >>>>> KIP-221. Going
> >>>>>>>>>>>>>>>> through
> >>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement
> >>>>> around
> >>>>>>>>>>>>>>>> usefulness of
> >>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>> feature.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I
> >>>>> understood, the
> >>>>>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through
> >>>>> method
> >>>>>>>>>>>>>>>> (essentially
> >>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
> >>>>>>>>>>>>>> configuration
> >>>>>>>>>>>>>>>>>>>>>> field.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to
> >>> control
> >>>>>>>>>>>>>>>> parallelism and
> >>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful
> >>> operations.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
> >>>>>>>>>>>>>>>> KStream#through if
> >>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>
> >>>
> >
> >
>

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by "Matthias J. Sax" <ma...@confluent.io>.
Sorry for late reply.

I guess, the question boils down to the intended semantics of
`repartition()`. My understanding is as follows:

- KS does auto-repartitioning for correctness reasons (using the
upstream topic to determine the number of partitions)
- KS does auto-repartitioning only for downstream DSL operators like
`count()` (eg, a `transform()` does never trigger an auto-repartitioning
even if the stream is marked as `repartitioningRequired`).
- KS offers `through()` to enforce a repartitioning -- however, the user
needs to create the topic manually (with the desired number of partitions).

I see two main applications for `repartitioning()`:

1) repartition data before a `transform()` but user does not want to
manage the topic
2) scale out a downstream subtopology

Hence, I see `repartition()` similar to `through()`: if a users calls
it, a repartitining is enforced, with the difference that KS manages the
topic and the user does not need to create it.

This behavior makes (1) and (2) possible.

> I think many users would prefer to just say "if there *is* a repartition
> required at this point in the topology, it should
> have N partitions"

Because of (2), I disagree. Either a user does not care about scaling
out, for which case she would not specify the number of partitions. Or a
user does care, and hence wants to enforce the scale out. I don't think
that any user would say, "maybe scale out".

Therefore, the optimizer should never ignore the repartition operation.
As a "consequence" (because repartitioning is expensive) a user should
make an explicit call to `repartition()` IMHO -- piggybacking an
enforced repartitioning into `groupByKey()` seems to be "dangerous"
because it might be too subtle and an "optional scaling out" as laid out
above does not make sense IMHO.

I am also not worried about "over repartitioning" because the result
stream would never trigger auto-repartitioning. Only if multiple
consecutive calls to `repartition()` are made it could be bad -- but
that's the same with `through()`. In the end, there is always some
responsibility on the user.

Btw, for `.groupBy()` we know that repartitioning will be required,
however, for `groupByKey()` it depends if the KStream is marked as
`repartitioningRequired`.

Hence, for `groupByKey()` it should not be possible for a user to set
number of partitions IMHO. For `groupBy()` it's a different story,
because calling

   `repartition().groupBy()`

does not achieve what we want. Hence, allowing users to pass in the
number of users partitions into `groupBy()` does actually makes sense,
because repartitioning will happen anyway and thus we can piggyback a
scaling decision.

I think that John has a fair concern about the overloads, however, I am
not convinced that using `Grouped` to specify the number of partitions
is intuitive. I double checked `Grouped` and `Repartitioned` and both
allow to specify a `name` and `keySerde/valueSerde`. Thus, I am
wondering if we could bridge the gap between both, if we would make
`Repartitioned extends Grouped`? For this case, we only need
`groupBy(Grouped)` and a user can pass in both types what seems to make
the API quite smooth:

  `stream.groupBy(..., Grouped...)`

  `stream.groupBy(..., Repartitioned...)`


Thoughts?


-Matthias



On 11/7/19 10:59 AM, Levani Kokhreidze wrote:
> Hi Sophie,
> 
> Thank you for your reply, very insightful. Looking forward hearing others opinion as well on this.
> 
> Kind regards,
> Levani
> 
> 
>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <so...@confluent.io> wrote:
>>
>>> Personally, I think Matthias’s concern is valid, but on the other hand
>> Kafka Streams has already
>>> optimizer in place which alters topology independently from user
>>
>> I agree (with you) and think this is a good way to put it -- we currently
>> auto-repartition for the user so
>> that they don't have to walk through their entire topology and reason about
>> when and where to place a
>> `.through` (or the new `.repartition`), so why suddenly force this onto the
>> user? How certain are we that
>> users will always get this right? It's easy to imagine that during
>> development, you write your new app with
>> correctly placed repartitions in order to use this new feature. During the
>> course of development you end up
>> tweaking the topology, but don't remember to review or move the
>> repartitioning since you're used to Streams
>> doing this for you. If you use only single-partition topics for testing,
>> you might not even notice your app is
>> spitting out incorrect results!
>>
>> Anyways, I feel pretty strongly that it would be weird to introduce a new
>> feature and say that to use it, you can't take
>> advantage of this other feature anymore. Also, is it possible our
>> optimization framework could ever include an
>> optimized repartitioning strategy that is better than what a user could
>> achieve by manually inserting repartitions?
>> Do we expect users to have a deep understanding of the best way to
>> repartition their particular topology, or is it
>> likely they will end up over-repartitioning either due to missed
>> optimizations or unnecessary extra repartitions?
>> I think many users would prefer to just say "if there *is* a repartition
>> required at this point in the topology, it should
>> have N partitions"
>>
>> As to the idea of adding `numberOfPartitions` to Grouped rather than
>> adding a new parameter to groupBy, that does seem more in line with the
>> current syntax so +1 from me
>>
>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <le...@gmail.com>
>> wrote:
>>
>>> Hello all,
>>>
>>> While https://github.com/apache/kafka/pull/7170 <
>>> https://github.com/apache/kafka/pull/7170> is under review and it’s
>>> almost done, I want to resurrect discussion about this KIP to address
>>> couple of concerns raised by Matthias and John.
>>>
>>> As a reminder, idea of the KIP-221 was to allow DSL users control over
>>> repartitioning and parallelism of sub-topologies by:
>>> 1) Introducing new KStream#repartition operation which is done in
>>> https://github.com/apache/kafka/pull/7170 <
>>> https://github.com/apache/kafka/pull/7170>
>>> 2) Add new KStream#groupBy(Repartitioned) operation, which is planned to
>>> be separate PR.
>>>
>>> While all agree about general implementation and idea behind
>>> https://github.com/apache/kafka/pull/7170 <
>>> https://github.com/apache/kafka/pull/7170> PR, introducing new
>>> KStream#groupBy(Repartitioned) method overload raised some questions during
>>> the review.
>>> Matthias raised concern that there can be cases when user uses
>>> `KStream#groupBy(Repartitioned)` operation, but actual repartitioning may
>>> not required, thus configuration passed via `Repartitioned` would never be
>>> applied (Matthias, please correct me if I misinterpreted your comment).
>>> So instead, if user wants to control parallelism of sub-topologies, he or
>>> she should always use `KStream#repartition` operation before groupBy. Full
>>> comment can be seen here:
>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
>>>
>>> On the same topic, John pointed out that, from API design perspective, we
>>> shouldn’t intertwine configuration classes of different operators between
>>> one another. So instead of introducing new `KStream#groupBy(Repartitioned)`
>>> for specifying number of partitions for internal topic, we should update
>>> existing `Grouped` class with `numberOfPartitions` field.
>>>
>>> Personally, I think Matthias’s concern is valid, but on the other hand
>>> Kafka Streams has already optimizer in place which alters topology
>>> independently from user. So maybe it makes sense if Kafka Streams,
>>> internally would optimize topology in the best way possible, even if in
>>> some cases this means ignoring some operator configurations passed by the
>>> user. Also, I agree with John about API design semantics. If we go through
>>> with the changes for `KStream#groupBy` operation, it makes more sense to
>>> add `numberOfPartitions` field to `Grouped` class instead of introducing
>>> new `KStream#groupBy(Repartitioned)` method overload.
>>>
>>> I would really appreciate communities feedback on this.
>>>
>>> Kind regards,
>>> Levani
>>>
>>>
>>>
>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <so...@confluent.io>
>>> wrote:
>>>>
>>>> Hey Levani,
>>>>
>>>> I think people are busy with the upcoming 2.4 release, and don't have
>>> much
>>>> spare time at the
>>>> moment. It's kind of a difficult time to get attention on things, but
>>> feel
>>>> free to pick up something else
>>>> to work on in the meantime until things have calmed down a bit!
>>>>
>>>> Cheers,
>>>> Sophie
>>>>
>>>>
>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> Sorry for bringing this thread again, but I would like to get some
>>>>> attention on this PR: https://github.com/apache/kafka/pull/7170 <
>>> https://github.com/apache/kafka/pull/7170> <
>>>>> https://github.com/apache/kafka/pull/7170 <
>>> https://github.com/apache/kafka/pull/7170>>
>>>>> It's been a while now and I would love to move on to other KIPs as well.
>>>>> Please let me know if you have any concerns.
>>>>>
>>>>> Regards,
>>>>> Levani
>>>>>
>>>>>
>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
>>> levani.codes@gmail.com <ma...@gmail.com>>
>>>>> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Here’s voting thread for this KIP:
>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html> <
>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>>
>>>>>>
>>>>>> Regards,
>>>>>> Levani
>>>>>>
>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>
>>>>>>> Hi Matthias,
>>>>>>>
>>>>>>> Thanks for the suggestion. I Don’t have strong opinion on that one.
>>>>>>> Agree that avoiding unnecessary method overloads is a good idea.
>>>>>>>
>>>>>>> Updated KIP
>>>>>>>
>>>>>>> Regards,
>>>>>>> Levani
>>>>>>>
>>>>>>>
>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <matthias@confluent.io
>>> <ma...@confluent.io>
>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>>>
>>>>>>>> One question:
>>>>>>>>
>>>>>>>> Why do we add
>>>>>>>>
>>>>>>>>> Repartitioned#with(final String name, final int numberOfPartitions)
>>>>>>>>
>>>>>>>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
>>>>>>>> combination with `withName()` and `withNumberOfPartitions()` should
>>> be
>>>>>>>> sufficient. Users can chain the method calls.
>>>>>>>>
>>>>>>>> (I think it's valuable to keep the number of overload small if
>>>>> possible.)
>>>>>>>>
>>>>>>>> Otherwise LGTM.
>>>>>>>>
>>>>>>>>
>>>>>>>> -Matthias
>>>>>>>>
>>>>>>>>
>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> Thanks all for your feedback.
>>>>>>>>> I started voting procedure for this KIP. If there’re any other
>>>>> concerns about this KIP, please let me know.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Levani
>>>>>>>>>
>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>> levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Matthias,
>>>>>>>>>>
>>>>>>>>>> Thanks for the suggestion, makes sense.
>>>>>>>>>> I’ve updated KIP (
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>
>>>>> <
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>
>>>>> <
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>
>>>>> <
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>
>>>>>>> ).
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Levani
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
>>> matthias@confluent.io <ma...@confluent.io>
>>>>> <mailto:matthias@confluent.io <ma...@confluent.io>> <mailto:
>>> matthias@confluent.io <ma...@confluent.io> <mailto:
>>>>> matthias@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Thanks for driving the KIP.
>>>>>>>>>>>
>>>>>>>>>>> I agree that users need to be able to specify a partitioning
>>>>> strategy.
>>>>>>>>>>>
>>>>>>>>>>> Sophie raises a fair point about topic configs and producer
>>>>> configs. My
>>>>>>>>>>> take is, that consider `Repartitioned` as an "extension" to
>>>>> `Produced`,
>>>>>>>>>>> that adds topic configuration, is a good way to think about it and
>>>>> helps
>>>>>>>>>>> to keep the API "clean".
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> With regard to method names. I would prefer to avoid
>>> abbreviations.
>>>>> Can
>>>>>>>>>>> we rename:
>>>>>>>>>>>
>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>>>>>>>>
>>>>>>>>>>> Furthermore, it might be good to add some more `static` methods:
>>>>>>>>>>>
>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Matthias
>>>>>>>>>>>
>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>>>>>>>> Totally agree. I think in KStream interface it makes sense to
>>> have
>>>>> some duplicate configurations between operators in order to keep API
>>> simple
>>>>> and usable.
>>>>>>>>>>>> Also, as more surface API has, harder it is to have proper
>>>>> backward compatibility.
>>>>>>>>>>>> While initial idea of keeping topic level configs separate was
>>>>> exciting, having Repartitioned class encapsulate some producer level
>>>>> configs makes API more readable.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Levani
>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think that is a good point about trying to keep producer level
>>>>>>>>>>>>> configurations and (repartition) topic level considerations
>>>>> separate.
>>>>>>>>>>>>> Number of partitions is definitely purely a topic level
>>>>> configuration. But
>>>>>>>>>>>>> on some level, serdes and partitioners are just as much a topic
>>>>>>>>>>>>> configuration as a producer one. You could have two producers
>>>>> configured
>>>>>>>>>>>>> with different serdes and/or partitioners, but if they are
>>>>> writing to the
>>>>>>>>>>>>> same topic the result would be very difficult to part. So in a
>>>>> sense, these
>>>>>>>>>>>>> are configurations of topics in Streams, not just producers.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Another way to think of it: while the Streams API is not always
>>>>> true to
>>>>>>>>>>>>> this, ideally all the relevant configs for an operator are
>>>>> wrapped into a
>>>>>>>>>>>>> single object (in this case, Repartitioned). We could instead
>>>>> split out the
>>>>>>>>>>>>> fields in common with Produced into a separate parameter to keep
>>>>> topic and
>>>>>>>>>>>>> producer level configurations separate, but this increases the
>>>>> API surface
>>>>>>>>>>>>> area by a lot. It's much more straightforward to just say "this
>>> is
>>>>>>>>>>>>> everything that this particular operator needs" without worrying
>>>>> about what
>>>>>>>>>>>>> exactly you're specifying.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I suppose you could alternatively make Produced a field of
>>>>> Repartitioned,
>>>>>>>>>>>>> but I don't think we do this kind of composition elsewhere in
>>>>> Streams at
>>>>>>>>>>>>> the moment
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>> levani.codes@gmail.com <ma...@gmail.com>> <mailto:
>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Bill,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
>>>>> `Repartitioned#partitioner`
>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class for topic level
>>>>>>>>>>>>>> configuration and keep topic level and producer level
>>>>> configurations (such
>>>>>>>>>>>>>> as Produced) separately (see my second email in this thread).
>>>>>>>>>>>>>> But while looking at the semantics of KStream interface, I
>>>>> couldn’t really
>>>>>>>>>>>>>> figure out good operation name for Topic level configuration
>>>>> class and just
>>>>>>>>>>>>>> introducing `Topic` config class was kinda breaking the
>>>>> semantics.
>>>>>>>>>>>>>> So I think having Repartitioned class which encapsulates topic
>>>>> and
>>>>>>>>>>>>>> producer level configurations for internal topics is viable
>>>>> thing to do.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <bbejeck@gmail.com
>>> <ma...@gmail.com>
>>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>> <mailto:
>>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
>>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Lavani,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.  In addition to
>>>>> the reason
>>>>>>>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. Users may want to use something other than hash-based
>>>>> partitioning
>>>>>>>>>>>>>>> 2. Users may wish to partition on something different than the
>>>>> key
>>>>>>>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>>>>>>> 1. A combination of fields in the value in conjunction with
>>>>> the key
>>>>>>>>>>>>>>> 2. Something other than the key
>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence
>>> in
>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes sense for API
>>>>> consistency.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Just my  2 cents.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>> <mailto:
>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>>>>>>>> If we add partitioner configuration to Repartitioned class,
>>>>> with the
>>>>>>>>>>>>>>>> combination of specifying number of partitions for internal
>>>>> topics, user
>>>>>>>>>>>>>>>> will have opportunity to ensure co-partitioning before join
>>>>> operation.
>>>>>>>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
>>> john@confluent.io <ma...@confluent.io>
>>>>> <mailto:john@confluent.io <ma...@confluent.io>> <mailto:
>>> john@confluent.io <ma...@confluent.io> <mailto:
>>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally
>>>>> sure it
>>>>>>>>>>>>>>>>> makes sense, but I believe something similar is the
>>> rationale
>>>>> for
>>>>>>>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hey John,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your example I would
>>>>> first issue
>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner that
>>>>> essentially
>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>> be the same as the topic I want to join with and then do the
>>>>>>>>>>>>>> KStream#join
>>>>>>>>>>>>>>>> with DSL?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
>>>>> john@confluent.io <ma...@confluent.io> <mailto:john@confluent.io
>>> <ma...@confluent.io>> <mailto:john@confluent.io <mailto:
>>> john@confluent.io>
>>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think it might be useful to have an option to specify
>>> the
>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may
>>>>> get
>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the
>>>>> right-side
>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the
>>>>> same
>>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss something
>>>>> important?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
>>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m
>>> not
>>>>> sure
>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would
>>>>> need control
>>>>>>>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best
>>>>> how to
>>>>>>>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for
>>>>> topics
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to
>>>>> specify
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> partitioner.
>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess
>>>>> specifying the
>>>>>>>>>>>>>>>> partitioner can be added as well if there’s agreement on
>>> this.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>> sophie@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned
>>>>> would be a
>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to
>>>>> Produced? I'm not
>>>>>>>>>>>>>>>> sure how
>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but seems it should
>>> be
>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over
>>>>> the topic
>>>>>>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that
>>> is
>>>>> missing
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by
>>>>> configuring
>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>>> sophie@confluent.io <ma...@confluent.io>>>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me
>>> --
>>>>> for the
>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned,
>>>>> will the
>>>>>>>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case
>>> for
>>>>> the
>>>>>>>>>>>>>>>> signature
>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be
>>>>> pre-created? The
>>>>>>>>>>>>>>>> wording
>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method
>>>>> will
>>>>>>>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>
>>>>> <
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>
>>>>> <
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>
>>>>> <
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>
>>>>> <
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns
>>> about
>>>>> this KIP.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated
>>>>> confluence
>>>>>>>>>>>>>> page
>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic`
>>>>> class for
>>>>>>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators like
>>> `groupedBy`.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this
>>> as
>>>>> well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
>>>>>>>>>>>>>>>> matthias@confluent.io
>>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for
>>>>> summarizing
>>>>>>>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed
>>>>> already (can't
>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to
>>>>> refresh the
>>>>>>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard
>>> to
>>>>> the
>>>>>>>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and
>>> operators
>>>>> that use
>>>>>>>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might
>>> indicate
>>>>> that
>>>>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses
>>>>> use-defined
>>>>>>>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an
>>>>> operator that
>>>>>>>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is
>>>>> one
>>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but
>>>>> sometimes
>>>>>>>>>>>>>> not; a
>>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction
>>>>> clearer.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am
>>>>> wondering
>>>>>>>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding
>>>>> it is
>>>>>>>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of
>>>>> others on
>>>>>>>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow
>>> to
>>>>>>>>>>>>>> configure
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we
>>>>> assume that
>>>>>>>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method
>>>>> instead of
>>>>>>>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other
>>> configs,
>>>>> like
>>>>>>>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the
>>>>> need to
>>>>>>>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like
>>>>> them
>>>>>>>>>>>>>> myself.
>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe
>>> enhancing
>>>>> Produced
>>>>>>>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach.
>>>>> Let me
>>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this
>>>>> configuration,
>>>>>>>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is
>>>>> the final
>>>>>>>>>>>>>>>> sink of
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable
>>> assumption
>>>>> that user
>>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that
>>>>> case, having
>>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API
>>>>> contract, seems
>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that is
>>>>> explicitly for
>>>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all
>>>>> are producer
>>>>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
>>>>>>>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
>>>>> configurations
>>>>>>>>>>>>>>>> together in
>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like
>>>>> Produced
>>>>>>>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g
>>> topics
>>>>> created
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we
>>>>> should leave
>>>>>>>>>>>>>>>> it as
>>>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we
>>>>> should
>>>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should
>>>>> create and
>>>>>>>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that
>>>>> should be
>>>>>>>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions
>>>>> configuration in
>>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level
>>>>> configuration in
>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in
>>>>> KStream#through - I
>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users
>>>>> possibility to
>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :)
>>> Considering
>>>>> arguments
>>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two
>>>>> options
>>>>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>>>>> num of
>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
>>> will
>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce
>>>>> new method
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this
>>>>> in one of
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the
>>>>> following
>>>>>>>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
>>> Grouped
>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
>>>>> KStream#through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>>>>> num of
>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
>>> will
>>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>>>>> specified num of
>>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
>>> Grouped
>>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for
>>>>> creating and
>>>>>>>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already
>>>>> discussed in the
>>>>>>>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were
>>>>> about this
>>>>>>>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>>> levani.codes@gmail.com>>
>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around
>>>>> KIP-221. Going
>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement
>>>>> around
>>>>>>>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I
>>>>> understood, the
>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through
>>>>> method
>>>>>>>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to
>>> control
>>>>>>>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful
>>> operations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
>>>>>>>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>
>>>
> 
> 


Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by Levani Kokhreidze <le...@gmail.com>.
Hi Sophie,

Thank you for your reply, very insightful. Looking forward hearing others opinion as well on this.

Kind regards,
Levani


> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <so...@confluent.io> wrote:
> 
>> Personally, I think Matthias’s concern is valid, but on the other hand
> Kafka Streams has already
>> optimizer in place which alters topology independently from user
> 
> I agree (with you) and think this is a good way to put it -- we currently
> auto-repartition for the user so
> that they don't have to walk through their entire topology and reason about
> when and where to place a
> `.through` (or the new `.repartition`), so why suddenly force this onto the
> user? How certain are we that
> users will always get this right? It's easy to imagine that during
> development, you write your new app with
> correctly placed repartitions in order to use this new feature. During the
> course of development you end up
> tweaking the topology, but don't remember to review or move the
> repartitioning since you're used to Streams
> doing this for you. If you use only single-partition topics for testing,
> you might not even notice your app is
> spitting out incorrect results!
> 
> Anyways, I feel pretty strongly that it would be weird to introduce a new
> feature and say that to use it, you can't take
> advantage of this other feature anymore. Also, is it possible our
> optimization framework could ever include an
> optimized repartitioning strategy that is better than what a user could
> achieve by manually inserting repartitions?
> Do we expect users to have a deep understanding of the best way to
> repartition their particular topology, or is it
> likely they will end up over-repartitioning either due to missed
> optimizations or unnecessary extra repartitions?
> I think many users would prefer to just say "if there *is* a repartition
> required at this point in the topology, it should
> have N partitions"
> 
> As to the idea of adding `numberOfPartitions` to Grouped rather than
> adding a new parameter to groupBy, that does seem more in line with the
> current syntax so +1 from me
> 
> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <le...@gmail.com>
> wrote:
> 
>> Hello all,
>> 
>> While https://github.com/apache/kafka/pull/7170 <
>> https://github.com/apache/kafka/pull/7170> is under review and it’s
>> almost done, I want to resurrect discussion about this KIP to address
>> couple of concerns raised by Matthias and John.
>> 
>> As a reminder, idea of the KIP-221 was to allow DSL users control over
>> repartitioning and parallelism of sub-topologies by:
>> 1) Introducing new KStream#repartition operation which is done in
>> https://github.com/apache/kafka/pull/7170 <
>> https://github.com/apache/kafka/pull/7170>
>> 2) Add new KStream#groupBy(Repartitioned) operation, which is planned to
>> be separate PR.
>> 
>> While all agree about general implementation and idea behind
>> https://github.com/apache/kafka/pull/7170 <
>> https://github.com/apache/kafka/pull/7170> PR, introducing new
>> KStream#groupBy(Repartitioned) method overload raised some questions during
>> the review.
>> Matthias raised concern that there can be cases when user uses
>> `KStream#groupBy(Repartitioned)` operation, but actual repartitioning may
>> not required, thus configuration passed via `Repartitioned` would never be
>> applied (Matthias, please correct me if I misinterpreted your comment).
>> So instead, if user wants to control parallelism of sub-topologies, he or
>> she should always use `KStream#repartition` operation before groupBy. Full
>> comment can be seen here:
>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
>> 
>> On the same topic, John pointed out that, from API design perspective, we
>> shouldn’t intertwine configuration classes of different operators between
>> one another. So instead of introducing new `KStream#groupBy(Repartitioned)`
>> for specifying number of partitions for internal topic, we should update
>> existing `Grouped` class with `numberOfPartitions` field.
>> 
>> Personally, I think Matthias’s concern is valid, but on the other hand
>> Kafka Streams has already optimizer in place which alters topology
>> independently from user. So maybe it makes sense if Kafka Streams,
>> internally would optimize topology in the best way possible, even if in
>> some cases this means ignoring some operator configurations passed by the
>> user. Also, I agree with John about API design semantics. If we go through
>> with the changes for `KStream#groupBy` operation, it makes more sense to
>> add `numberOfPartitions` field to `Grouped` class instead of introducing
>> new `KStream#groupBy(Repartitioned)` method overload.
>> 
>> I would really appreciate communities feedback on this.
>> 
>> Kind regards,
>> Levani
>> 
>> 
>> 
>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <so...@confluent.io>
>> wrote:
>>> 
>>> Hey Levani,
>>> 
>>> I think people are busy with the upcoming 2.4 release, and don't have
>> much
>>> spare time at the
>>> moment. It's kind of a difficult time to get attention on things, but
>> feel
>>> free to pick up something else
>>> to work on in the meantime until things have calmed down a bit!
>>> 
>>> Cheers,
>>> Sophie
>>> 
>>> 
>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
>> levani.codes@gmail.com <ma...@gmail.com>>
>>> wrote:
>>> 
>>>> Hello all,
>>>> 
>>>> Sorry for bringing this thread again, but I would like to get some
>>>> attention on this PR: https://github.com/apache/kafka/pull/7170 <
>> https://github.com/apache/kafka/pull/7170> <
>>>> https://github.com/apache/kafka/pull/7170 <
>> https://github.com/apache/kafka/pull/7170>>
>>>> It's been a while now and I would love to move on to other KIPs as well.
>>>> Please let me know if you have any concerns.
>>>> 
>>>> Regards,
>>>> Levani
>>>> 
>>>> 
>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
>> levani.codes@gmail.com <ma...@gmail.com>>
>>>> wrote:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> Here’s voting thread for this KIP:
>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html> <
>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>>
>>>>> 
>>>>> Regards,
>>>>> Levani
>>>>> 
>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
>> levani.codes@gmail.com <ma...@gmail.com>
>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>>>> 
>>>>>> Hi Matthias,
>>>>>> 
>>>>>> Thanks for the suggestion. I Don’t have strong opinion on that one.
>>>>>> Agree that avoiding unnecessary method overloads is a good idea.
>>>>>> 
>>>>>> Updated KIP
>>>>>> 
>>>>>> Regards,
>>>>>> Levani
>>>>>> 
>>>>>> 
>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <matthias@confluent.io
>> <ma...@confluent.io>
>>>> <mailto:matthias@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>> 
>>>>>>> One question:
>>>>>>> 
>>>>>>> Why do we add
>>>>>>> 
>>>>>>>> Repartitioned#with(final String name, final int numberOfPartitions)
>>>>>>> 
>>>>>>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
>>>>>>> combination with `withName()` and `withNumberOfPartitions()` should
>> be
>>>>>>> sufficient. Users can chain the method calls.
>>>>>>> 
>>>>>>> (I think it's valuable to keep the number of overload small if
>>>> possible.)
>>>>>>> 
>>>>>>> Otherwise LGTM.
>>>>>>> 
>>>>>>> 
>>>>>>> -Matthias
>>>>>>> 
>>>>>>> 
>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> Thanks all for your feedback.
>>>>>>>> I started voting procedure for this KIP. If there’re any other
>>>> concerns about this KIP, please let me know.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Levani
>>>>>>>> 
>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>> levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Matthias,
>>>>>>>>> 
>>>>>>>>> Thanks for the suggestion, makes sense.
>>>>>>>>> I’ve updated KIP (
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> 
>>>> <
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>> 
>>>> <
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> 
>>>> <
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> 
>>>>>> ).
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Levani
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
>> matthias@confluent.io <ma...@confluent.io>
>>>> <mailto:matthias@confluent.io <ma...@confluent.io>> <mailto:
>> matthias@confluent.io <ma...@confluent.io> <mailto:
>>>> matthias@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Thanks for driving the KIP.
>>>>>>>>>> 
>>>>>>>>>> I agree that users need to be able to specify a partitioning
>>>> strategy.
>>>>>>>>>> 
>>>>>>>>>> Sophie raises a fair point about topic configs and producer
>>>> configs. My
>>>>>>>>>> take is, that consider `Repartitioned` as an "extension" to
>>>> `Produced`,
>>>>>>>>>> that adds topic configuration, is a good way to think about it and
>>>> helps
>>>>>>>>>> to keep the API "clean".
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> With regard to method names. I would prefer to avoid
>> abbreviations.
>>>> Can
>>>>>>>>>> we rename:
>>>>>>>>>> 
>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>>>>>>> 
>>>>>>>>>> Furthermore, it might be good to add some more `static` methods:
>>>>>>>>>> 
>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -Matthias
>>>>>>>>>> 
>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>>>>>>> Totally agree. I think in KStream interface it makes sense to
>> have
>>>> some duplicate configurations between operators in order to keep API
>> simple
>>>> and usable.
>>>>>>>>>>> Also, as more surface API has, harder it is to have proper
>>>> backward compatibility.
>>>>>>>>>>> While initial idea of keeping topic level configs separate was
>>>> exciting, having Repartitioned class encapsulate some producer level
>>>> configs makes API more readable.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Levani
>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>> sophie@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> I think that is a good point about trying to keep producer level
>>>>>>>>>>>> configurations and (repartition) topic level considerations
>>>> separate.
>>>>>>>>>>>> Number of partitions is definitely purely a topic level
>>>> configuration. But
>>>>>>>>>>>> on some level, serdes and partitioners are just as much a topic
>>>>>>>>>>>> configuration as a producer one. You could have two producers
>>>> configured
>>>>>>>>>>>> with different serdes and/or partitioners, but if they are
>>>> writing to the
>>>>>>>>>>>> same topic the result would be very difficult to part. So in a
>>>> sense, these
>>>>>>>>>>>> are configurations of topics in Streams, not just producers.
>>>>>>>>>>>> 
>>>>>>>>>>>> Another way to think of it: while the Streams API is not always
>>>> true to
>>>>>>>>>>>> this, ideally all the relevant configs for an operator are
>>>> wrapped into a
>>>>>>>>>>>> single object (in this case, Repartitioned). We could instead
>>>> split out the
>>>>>>>>>>>> fields in common with Produced into a separate parameter to keep
>>>> topic and
>>>>>>>>>>>> producer level configurations separate, but this increases the
>>>> API surface
>>>>>>>>>>>> area by a lot. It's much more straightforward to just say "this
>> is
>>>>>>>>>>>> everything that this particular operator needs" without worrying
>>>> about what
>>>>>>>>>>>> exactly you're specifying.
>>>>>>>>>>>> 
>>>>>>>>>>>> I suppose you could alternatively make Produced a field of
>>>> Repartitioned,
>>>>>>>>>>>> but I don't think we do this kind of composition elsewhere in
>>>> Streams at
>>>>>>>>>>>> the moment
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>> levani.codes@gmail.com <ma...@gmail.com>> <mailto:
>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Bill,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
>>>> `Repartitioned#partitioner`
>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>> In the beginning, I wanted to introduce a class for topic level
>>>>>>>>>>>>> configuration and keep topic level and producer level
>>>> configurations (such
>>>>>>>>>>>>> as Produced) separately (see my second email in this thread).
>>>>>>>>>>>>> But while looking at the semantics of KStream interface, I
>>>> couldn’t really
>>>>>>>>>>>>> figure out good operation name for Topic level configuration
>>>> class and just
>>>>>>>>>>>>> introducing `Topic` config class was kinda breaking the
>>>> semantics.
>>>>>>>>>>>>> So I think having Repartitioned class which encapsulates topic
>>>> and
>>>>>>>>>>>>> producer level configurations for internal topics is viable
>>>> thing to do.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Levani
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <bbejeck@gmail.com
>> <ma...@gmail.com>
>>>> <mailto:bbejeck@gmail.com <ma...@gmail.com>> <mailto:
>> bbejeck@gmail.com <ma...@gmail.com> <mailto:
>>>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Lavani,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm also a +1 for adding a partition option.  In addition to
>>>> the reason
>>>>>>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1. Users may want to use something other than hash-based
>>>> partitioning
>>>>>>>>>>>>>> 2. Users may wish to partition on something different than the
>>>> key
>>>>>>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>>>>>> 1. A combination of fields in the value in conjunction with
>>>> the key
>>>>>>>>>>>>>> 2. Something other than the key
>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence
>> in
>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes sense for API
>>>> consistency.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Just my  2 cents.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>> <mailto:levani.codes@gmail.com <ma...@gmail.com>> <mailto:
>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>>>>>>> If we add partitioner configuration to Repartitioned class,
>>>> with the
>>>>>>>>>>>>>>> combination of specifying number of partitions for internal
>>>> topics, user
>>>>>>>>>>>>>>> will have opportunity to ensure co-partitioning before join
>>>> operation.
>>>>>>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
>> john@confluent.io <ma...@confluent.io>
>>>> <mailto:john@confluent.io <ma...@confluent.io>> <mailto:
>> john@confluent.io <ma...@confluent.io> <mailto:
>>>> john@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally
>>>> sure it
>>>>>>>>>>>>>>>> makes sense, but I believe something similar is the
>> rationale
>>>> for
>>>>>>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hey John,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>>>>>>> Do I understand this correctly, in your example I would
>>>> first issue
>>>>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner that
>>>> essentially
>>>>>>>>>>>>> would
>>>>>>>>>>>>>>> be the same as the topic I want to join with and then do the
>>>>>>>>>>>>> KStream#join
>>>>>>>>>>>>>>> with DSL?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
>>>> john@confluent.io <ma...@confluent.io> <mailto:john@confluent.io
>> <ma...@confluent.io>> <mailto:john@confluent.io <mailto:
>> john@confluent.io>
>>>> <mailto:john@confluent.io <ma...@confluent.io>>>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I think it might be useful to have an option to specify
>> the
>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may
>>>> get
>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the
>>>> right-side
>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the
>>>> same
>>>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss something
>>>> important?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
>>>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m
>> not
>>>> sure
>>>>>>>>>>>>> about
>>>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would
>>>> need control
>>>>>>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best
>>>> how to
>>>>>>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for
>>>> topics
>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to
>>>> specify
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> partitioner.
>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess
>>>> specifying the
>>>>>>>>>>>>>>> partitioner can be added as well if there’s agreement on
>> this.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>> sophie@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned
>>>> would be a
>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to
>>>> Produced? I'm not
>>>>>>>>>>>>>>> sure how
>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but seems it should
>> be
>>>>>>>>>>>>> available
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over
>>>> the topic
>>>>>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that
>> is
>>>> missing
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by
>>>> configuring
>>>>>>>>>>>>> num
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
>> sophie@confluent.io <ma...@confluent.io>> <mailto:
>>>> sophie@confluent.io <ma...@confluent.io>>>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me
>> --
>>>> for the
>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned,
>>>> will the
>>>>>>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case
>> for
>>>> the
>>>>>>>>>>>>>>> signature
>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be
>>>> pre-created? The
>>>>>>>>>>>>>>> wording
>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method
>>>> will
>>>>>>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>>>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
>> levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> 
>>>> <
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>> 
>>>> <
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> 
>>>> <
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> 
>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> 
>>>> <
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>> 
>>>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns
>> about
>>>> this KIP.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated
>>>> confluence
>>>>>>>>>>>>> page
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic`
>>>> class for
>>>>>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators like
>> `groupedBy`.
>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this
>> as
>>>> well.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
>>>>>>>>>>>>>>> matthias@confluent.io
>>>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for
>>>> summarizing
>>>>>>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed
>>>> already (can't
>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to
>>>> refresh the
>>>>>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard
>> to
>>>> the
>>>>>>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and
>> operators
>>>> that use
>>>>>>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might
>> indicate
>>>> that
>>>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses
>>>> use-defined
>>>>>>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an
>>>> operator that
>>>>>>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is
>>>> one
>>>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but
>>>> sometimes
>>>>>>>>>>>>> not; a
>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction
>>>> clearer.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am
>>>> wondering
>>>>>>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding
>>>> it is
>>>>>>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of
>>>> others on
>>>>>>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow
>> to
>>>>>>>>>>>>> configure
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we
>>>> assume that
>>>>>>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method
>>>> instead of
>>>>>>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other
>> configs,
>>>> like
>>>>>>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the
>>>> need to
>>>>>>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like
>>>> them
>>>>>>>>>>>>> myself.
>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe
>> enhancing
>>>> Produced
>>>>>>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach.
>>>> Let me
>>>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this
>>>> configuration,
>>>>>>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is
>>>> the final
>>>>>>>>>>>>>>> sink of
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable
>> assumption
>>>> that user
>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that
>>>> case, having
>>>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API
>>>> contract, seems
>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that is
>>>> explicitly for
>>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all
>>>> are producer
>>>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
>>>>>>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
>>>> configurations
>>>>>>>>>>>>>>> together in
>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like
>>>> Produced
>>>>>>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g
>> topics
>>>> created
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we
>>>> should leave
>>>>>>>>>>>>>>> it as
>>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we
>>>> should
>>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should
>>>> create and
>>>>>>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that
>>>> should be
>>>>>>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions
>>>> configuration in
>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level
>>>> configuration in
>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in
>>>> KStream#through - I
>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users
>>>> possibility to
>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :)
>> Considering
>>>> arguments
>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two
>>>> options
>>>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>>>> num of
>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
>> will
>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>>>> specified num of
>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce
>>>> new method
>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this
>>>> in one of
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the
>>>> following
>>>>>>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
>> Grouped
>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
>>>> KStream#through
>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>>>> num of
>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
>> will
>>>> be auto
>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>>>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>>>> specified num of
>>>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
>> Grouped
>>>> class (as
>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for
>>>> creating and
>>>>>>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already
>>>> discussed in the
>>>>>>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were
>>>> about this
>>>>>>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
>> levani.codes@gmail.com>>
>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around
>>>> KIP-221. Going
>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement
>>>> around
>>>>>>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I
>>>> understood, the
>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through
>>>> method
>>>>>>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to
>> control
>>>>>>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful
>> operations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
>>>>>>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>> 
>> 



Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by Sophie Blee-Goldman <so...@confluent.io>.
> Personally, I think Matthias’s concern is valid, but on the other hand
Kafka Streams has already
> optimizer in place which alters topology independently from user

I agree (with you) and think this is a good way to put it -- we currently
auto-repartition for the user so
that they don't have to walk through their entire topology and reason about
when and where to place a
`.through` (or the new `.repartition`), so why suddenly force this onto the
user? How certain are we that
users will always get this right? It's easy to imagine that during
development, you write your new app with
correctly placed repartitions in order to use this new feature. During the
course of development you end up
tweaking the topology, but don't remember to review or move the
repartitioning since you're used to Streams
doing this for you. If you use only single-partition topics for testing,
you might not even notice your app is
spitting out incorrect results!

Anyways, I feel pretty strongly that it would be weird to introduce a new
feature and say that to use it, you can't take
advantage of this other feature anymore. Also, is it possible our
optimization framework could ever include an
optimized repartitioning strategy that is better than what a user could
achieve by manually inserting repartitions?
Do we expect users to have a deep understanding of the best way to
repartition their particular topology, or is it
likely they will end up over-repartitioning either due to missed
optimizations or unnecessary extra repartitions?
I think many users would prefer to just say "if there *is* a repartition
required at this point in the topology, it should
have N partitions"

As to the idea of adding `numberOfPartitions` to Grouped rather than
adding a new parameter to groupBy, that does seem more in line with the
current syntax so +1 from me

On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <le...@gmail.com>
wrote:

> Hello all,
>
> While https://github.com/apache/kafka/pull/7170 <
> https://github.com/apache/kafka/pull/7170> is under review and it’s
> almost done, I want to resurrect discussion about this KIP to address
> couple of concerns raised by Matthias and John.
>
> As a reminder, idea of the KIP-221 was to allow DSL users control over
> repartitioning and parallelism of sub-topologies by:
> 1) Introducing new KStream#repartition operation which is done in
> https://github.com/apache/kafka/pull/7170 <
> https://github.com/apache/kafka/pull/7170>
> 2) Add new KStream#groupBy(Repartitioned) operation, which is planned to
> be separate PR.
>
> While all agree about general implementation and idea behind
> https://github.com/apache/kafka/pull/7170 <
> https://github.com/apache/kafka/pull/7170> PR, introducing new
> KStream#groupBy(Repartitioned) method overload raised some questions during
> the review.
> Matthias raised concern that there can be cases when user uses
> `KStream#groupBy(Repartitioned)` operation, but actual repartitioning may
> not required, thus configuration passed via `Repartitioned` would never be
> applied (Matthias, please correct me if I misinterpreted your comment).
> So instead, if user wants to control parallelism of sub-topologies, he or
> she should always use `KStream#repartition` operation before groupBy. Full
> comment can be seen here:
> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <
> https://github.com/apache/kafka/pull/7170#issuecomment-519303125>
>
> On the same topic, John pointed out that, from API design perspective, we
> shouldn’t intertwine configuration classes of different operators between
> one another. So instead of introducing new `KStream#groupBy(Repartitioned)`
> for specifying number of partitions for internal topic, we should update
> existing `Grouped` class with `numberOfPartitions` field.
>
> Personally, I think Matthias’s concern is valid, but on the other hand
> Kafka Streams has already optimizer in place which alters topology
> independently from user. So maybe it makes sense if Kafka Streams,
> internally would optimize topology in the best way possible, even if in
> some cases this means ignoring some operator configurations passed by the
> user. Also, I agree with John about API design semantics. If we go through
> with the changes for `KStream#groupBy` operation, it makes more sense to
> add `numberOfPartitions` field to `Grouped` class instead of introducing
> new `KStream#groupBy(Repartitioned)` method overload.
>
> I would really appreciate communities feedback on this.
>
> Kind regards,
> Levani
>
>
>
> > On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <so...@confluent.io>
> wrote:
> >
> > Hey Levani,
> >
> > I think people are busy with the upcoming 2.4 release, and don't have
> much
> > spare time at the
> > moment. It's kind of a difficult time to get attention on things, but
> feel
> > free to pick up something else
> > to work on in the meantime until things have calmed down a bit!
> >
> > Cheers,
> > Sophie
> >
> >
> > On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <
> levani.codes@gmail.com <ma...@gmail.com>>
> > wrote:
> >
> >> Hello all,
> >>
> >> Sorry for bringing this thread again, but I would like to get some
> >> attention on this PR: https://github.com/apache/kafka/pull/7170 <
> https://github.com/apache/kafka/pull/7170> <
> >> https://github.com/apache/kafka/pull/7170 <
> https://github.com/apache/kafka/pull/7170>>
> >> It's been a while now and I would love to move on to other KIPs as well.
> >> Please let me know if you have any concerns.
> >>
> >> Regards,
> >> Levani
> >>
> >>
> >>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <
> levani.codes@gmail.com <ma...@gmail.com>>
> >> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> Here’s voting thread for this KIP:
> >> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html> <
> >> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>>
> >>>
> >>> Regards,
> >>> Levani
> >>>
> >>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <
> levani.codes@gmail.com <ma...@gmail.com>
> >> <mailto:levani.codes@gmail.com <ma...@gmail.com>>> wrote:
> >>>>
> >>>> Hi Matthias,
> >>>>
> >>>> Thanks for the suggestion. I Don’t have strong opinion on that one.
> >>>> Agree that avoiding unnecessary method overloads is a good idea.
> >>>>
> >>>> Updated KIP
> >>>>
> >>>> Regards,
> >>>> Levani
> >>>>
> >>>>
> >>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <matthias@confluent.io
> <ma...@confluent.io>
> >> <mailto:matthias@confluent.io <ma...@confluent.io>>> wrote:
> >>>>>
> >>>>> One question:
> >>>>>
> >>>>> Why do we add
> >>>>>
> >>>>>> Repartitioned#with(final String name, final int numberOfPartitions)
> >>>>>
> >>>>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
> >>>>> combination with `withName()` and `withNumberOfPartitions()` should
> be
> >>>>> sufficient. Users can chain the method calls.
> >>>>>
> >>>>> (I think it's valuable to keep the number of overload small if
> >> possible.)
> >>>>>
> >>>>> Otherwise LGTM.
> >>>>>
> >>>>>
> >>>>> -Matthias
> >>>>>
> >>>>>
> >>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
> >>>>>> Hello,
> >>>>>>
> >>>>>> Thanks all for your feedback.
> >>>>>> I started voting procedure for this KIP. If there’re any other
> >> concerns about this KIP, please let me know.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Levani
> >>>>>>
> >>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
> >> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> levani.codes@gmail.com <ma...@gmail.com>>> wrote:
> >>>>>>>
> >>>>>>> Hi Matthias,
> >>>>>>>
> >>>>>>> Thanks for the suggestion, makes sense.
> >>>>>>> I’ve updated KIP (
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >
> >> <
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>
> >> <
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >
> >> <
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >
> >>>> ).
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Levani
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <
> matthias@confluent.io <ma...@confluent.io>
> >> <mailto:matthias@confluent.io <ma...@confluent.io>> <mailto:
> matthias@confluent.io <ma...@confluent.io> <mailto:
> >> matthias@confluent.io <ma...@confluent.io>>>> wrote:
> >>>>>>>>
> >>>>>>>> Thanks for driving the KIP.
> >>>>>>>>
> >>>>>>>> I agree that users need to be able to specify a partitioning
> >> strategy.
> >>>>>>>>
> >>>>>>>> Sophie raises a fair point about topic configs and producer
> >> configs. My
> >>>>>>>> take is, that consider `Repartitioned` as an "extension" to
> >> `Produced`,
> >>>>>>>> that adds topic configuration, is a good way to think about it and
> >> helps
> >>>>>>>> to keep the API "clean".
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> With regard to method names. I would prefer to avoid
> abbreviations.
> >> Can
> >>>>>>>> we rename:
> >>>>>>>>
> >>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
> >>>>>>>>
> >>>>>>>> Furthermore, it might be good to add some more `static` methods:
> >>>>>>>>
> >>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
> >>>>>>>> - Repartitioned.withNumberOfPartitions(int)
> >>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> -Matthias
> >>>>>>>>
> >>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
> >>>>>>>>> Totally agree. I think in KStream interface it makes sense to
> have
> >> some duplicate configurations between operators in order to keep API
> simple
> >> and usable.
> >>>>>>>>> Also, as more surface API has, harder it is to have proper
> >> backward compatibility.
> >>>>>>>>> While initial idea of keeping topic level configs separate was
> >> exciting, having Repartitioned class encapsulate some producer level
> >> configs makes API more readable.
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Levani
> >>>>>>>>>
> >>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
> >> sophie@confluent.io <ma...@confluent.io> <mailto:
> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >> sophie@confluent.io <ma...@confluent.io> <mailto:
> sophie@confluent.io <ma...@confluent.io>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I think that is a good point about trying to keep producer level
> >>>>>>>>>> configurations and (repartition) topic level considerations
> >> separate.
> >>>>>>>>>> Number of partitions is definitely purely a topic level
> >> configuration. But
> >>>>>>>>>> on some level, serdes and partitioners are just as much a topic
> >>>>>>>>>> configuration as a producer one. You could have two producers
> >> configured
> >>>>>>>>>> with different serdes and/or partitioners, but if they are
> >> writing to the
> >>>>>>>>>> same topic the result would be very difficult to part. So in a
> >> sense, these
> >>>>>>>>>> are configurations of topics in Streams, not just producers.
> >>>>>>>>>>
> >>>>>>>>>> Another way to think of it: while the Streams API is not always
> >> true to
> >>>>>>>>>> this, ideally all the relevant configs for an operator are
> >> wrapped into a
> >>>>>>>>>> single object (in this case, Repartitioned). We could instead
> >> split out the
> >>>>>>>>>> fields in common with Produced into a separate parameter to keep
> >> topic and
> >>>>>>>>>> producer level configurations separate, but this increases the
> >> API surface
> >>>>>>>>>> area by a lot. It's much more straightforward to just say "this
> is
> >>>>>>>>>> everything that this particular operator needs" without worrying
> >> about what
> >>>>>>>>>> exactly you're specifying.
> >>>>>>>>>>
> >>>>>>>>>> I suppose you could alternatively make Produced a field of
> >> Repartitioned,
> >>>>>>>>>> but I don't think we do this kind of composition elsewhere in
> >> Streams at
> >>>>>>>>>> the moment
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
> >> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> levani.codes@gmail.com <ma...@gmail.com>> <mailto:
> >> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Bill,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks a lot for the feedback.
> >>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
> >> `Repartitioned#partitioner`
> >>>>>>>>>>> configuration.
> >>>>>>>>>>> In the beginning, I wanted to introduce a class for topic level
> >>>>>>>>>>> configuration and keep topic level and producer level
> >> configurations (such
> >>>>>>>>>>> as Produced) separately (see my second email in this thread).
> >>>>>>>>>>> But while looking at the semantics of KStream interface, I
> >> couldn’t really
> >>>>>>>>>>> figure out good operation name for Topic level configuration
> >> class and just
> >>>>>>>>>>> introducing `Topic` config class was kinda breaking the
> >> semantics.
> >>>>>>>>>>> So I think having Repartitioned class which encapsulates topic
> >> and
> >>>>>>>>>>> producer level configurations for internal topics is viable
> >> thing to do.
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Levani
> >>>>>>>>>>>
> >>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <bbejeck@gmail.com
> <ma...@gmail.com>
> >> <mailto:bbejeck@gmail.com <ma...@gmail.com>> <mailto:
> bbejeck@gmail.com <ma...@gmail.com> <mailto:
> >> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Lavani,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks for resurrecting this KIP.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm also a +1 for adding a partition option.  In addition to
> >> the reason
> >>>>>>>>>>>> provided by John, my reasoning is:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. Users may want to use something other than hash-based
> >> partitioning
> >>>>>>>>>>>> 2. Users may wish to partition on something different than the
> >> key
> >>>>>>>>>>>> without having to change the key.  For example:
> >>>>>>>>>>>> 1. A combination of fields in the value in conjunction with
> >> the key
> >>>>>>>>>>>> 2. Something other than the key
> >>>>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence
> in
> >>>>>>>>>>>> KStream.to and KStream.through, so it makes sense for API
> >> consistency.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Just my  2 cents.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Bill
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
> >>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> <mailto:levani.codes@gmail.com <ma...@gmail.com>> <mailto:
> >> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi John,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> In my mind it makes sense.
> >>>>>>>>>>>>> If we add partitioner configuration to Repartitioned class,
> >> with the
> >>>>>>>>>>>>> combination of specifying number of partitions for internal
> >> topics, user
> >>>>>>>>>>>>> will have opportunity to ensure co-partitioning before join
> >> operation.
> >>>>>>>>>>>>> I think this can be quite powerful feature.
> >>>>>>>>>>>>> Wondering what others think about this?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <
> john@confluent.io <ma...@confluent.io>
> >> <mailto:john@confluent.io <ma...@confluent.io>> <mailto:
> john@confluent.io <ma...@confluent.io> <mailto:
> >> john@confluent.io <ma...@confluent.io>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally
> >> sure it
> >>>>>>>>>>>>>> makes sense, but I believe something similar is the
> rationale
> >> for
> >>>>>>>>>>>>>> having the partitioner option in Produced.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
> >>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
> >> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hey John,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Oh that’s interesting use-case.
> >>>>>>>>>>>>>>> Do I understand this correctly, in your example I would
> >> first issue
> >>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner that
> >> essentially
> >>>>>>>>>>> would
> >>>>>>>>>>>>> be the same as the topic I want to join with and then do the
> >>>>>>>>>>> KStream#join
> >>>>>>>>>>>>> with DSL?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
> >> john@confluent.io <ma...@confluent.io> <mailto:john@confluent.io
> <ma...@confluent.io>> <mailto:john@confluent.io <mailto:
> john@confluent.io>
> >> <mailto:john@confluent.io <ma...@confluent.io>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hey, all, just to chime in,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I think it might be useful to have an option to specify
> the
> >>>>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may
> >> get
> >>>>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the
> >> right-side
> >>>>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
> >>>>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the
> >> same
> >>>>>>>>>>>>>>>> strategy.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss something
> >> important?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
> >>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
> >> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
> levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m
> not
> >> sure
> >>>>>>>>>>> about
> >>>>>>>>>>>>> it yet.
> >>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would
> >> need control
> >>>>>>>>>>>>> over partitioner for internal topics.
> >>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best
> >> how to
> >>>>>>>>>>>>> partition data for internal topics.
> >>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for
> >> topics
> >>>>>>>>>>> that
> >>>>>>>>>>>>> are created by user In advance.
> >>>>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to
> >> specify
> >>>>>>>>>>> the
> >>>>>>>>>>>>> partitioner.
> >>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess
> >> specifying the
> >>>>>>>>>>>>> partitioner can be added as well if there’s agreement on
> this.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
> >>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >> sophie@confluent.io <ma...@confluent.io>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned
> >> would be a
> >>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
> >>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to
> >> Produced? I'm not
> >>>>>>>>>>>>> sure how
> >>>>>>>>>>>>>>>>>> widely this feature is really used, but seems it should
> be
> >>>>>>>>>>> available
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> repartition topics.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
> >>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hey Sophie,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
> >>>>>>>>>>>>> KStream#repartition(Repartitioned)
> >>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
> >>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over
> >> the topic
> >>>>>>>>>>>>> such as
> >>>>>>>>>>>>>>>>>>> num of partitions.
> >>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that
> is
> >> missing
> >>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> current DSL design.
> >>>>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by
> >> configuring
> >>>>>>>>>>> num
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> partitions for internal topics.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hope this answers your question.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
> >>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> sophie@confluent.io <ma...@confluent.io>> <mailto:
> >> sophie@confluent.io <ma...@confluent.io>>>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hey Levani,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me
> --
> >> for the
> >>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned,
> >> will the
> >>>>>>>>>>>>> topic be
> >>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case
> for
> >> the
> >>>>>>>>>>>>> signature
> >>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be
> >> pre-created? The
> >>>>>>>>>>>>> wording
> >>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method
> >> will
> >>>>>>>>>>>>> auto-create
> >>>>>>>>>>>>>>>>>>>> topics while the other will not.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>>> Sophie
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> >> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:
> levani.codes@gmail.com <ma...@gmail.com>>>>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >
> >> <
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>
> >> <
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >
> >> <
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >
> >>>>
> >>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >
> >> <
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>
> >>>>>>>>>>>>>>>>>>>> )
> >>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
> >>>>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns
> about
> >> this KIP.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
> >>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated
> >> confluence
> >>>>>>>>>>> page
> >>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>> the new proposal
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> King regards,
> >>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
> >>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic`
> >> class for
> >>>>>>>>>>>>> topic
> >>>>>>>>>>>>>>>>>>>>> configuration for internal operators like
> `groupedBy`.
> >>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this
> as
> >> well.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
> >>>>>>>>>>>>> matthias@confluent.io
> >>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Levani,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for
> >> summarizing
> >>>>>>>>>>>>>>>>>>> everything.
> >>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed
> >> already (can't
> >>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to
> >> refresh the
> >>>>>>>>>>>>>>>>>>>>> discussion.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard
> to
> >> the
> >>>>>>>>>>>>> distinction
> >>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and
> operators
> >> that use
> >>>>>>>>>>>>>>>>>>>>> user-created
> >>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might
> indicate
> >> that
> >>>>>>>>>>>>> leaving
> >>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses
> >> use-defined
> >>>>>>>>>>>>> topics) and
> >>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an
> >> operator that
> >>>>>>>>>>>>> manages
> >>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is
> >> one
> >>>>>>>>>>> operator
> >>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but
> >> sometimes
> >>>>>>>>>>> not; a
> >>>>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction
> >> clearer.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am
> >> wondering
> >>>>>>>>>>>>> if the
> >>>>>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding
> >> it is
> >>>>>>>>>>>>> semantically
> >>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of
> >> others on
> >>>>>>>>>>>>> this, too.
> >>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>> am
> >>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow
> to
> >>>>>>>>>>> configure
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>> certain
> >>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we
> >> assume that
> >>>>>>>>>>>>> a topic
> >>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
> >>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method
> >> instead of
> >>>>>>>>>>>>> a plain
> >>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other
> configs,
> >> like
> >>>>>>>>>>>>> replication
> >>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the
> >> need to
> >>>>>>>>>>>>> change the
> >>>>>>>>>>>>>>>>>>>>> "main
> >>>>>>>>>>>>>>>>>>>>>>>>> API".
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like
> >> them
> >>>>>>>>>>> myself.
> >>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe
> enhancing
> >> Produced
> >>>>>>>>>>>>> with num
> >>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach.
> >> Let me
> >>>>>>>>>>>>> explain
> >>>>>>>>>>>>>>>>>>> why:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this
> >> configuration,
> >>>>>>>>>>>>> this will
> >>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is
> >> the final
> >>>>>>>>>>>>> sink of
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable
> assumption
> >> that user
> >>>>>>>>>>>>> needs
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that
> >> case, having
> >>>>>>>>>>>>> num of
> >>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API
> >> contract, seems
> >>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that is
> >> explicitly for
> >>>>>>>>>>>>> producer
> >>>>>>>>>>>>>>>>>>> (key
> >>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all
> >> are producer
> >>>>>>>>>>>>>>>>>>> specific
> >>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
> >>>>>>>>>>>>> configuration. And
> >>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
> >> configurations
> >>>>>>>>>>>>> together in
> >>>>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>>>> class is the good approach.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like
> >> Produced
> >>>>>>>>>>>>> parameter is
> >>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g
> topics
> >> created
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> managed
> >>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we
> >> should leave
> >>>>>>>>>>>>> it as
> >>>>>>>>>>>>>>>>>>> it is
> >>>>>>>>>>>>>>>>>>>>> in that case.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we
> >> should
> >>>>>>>>>>>>> distinguish
> >>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should
> >> create and
> >>>>>>>>>>>>> manage
> >>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that
> >> should be
> >>>>>>>>>>>>> created in
> >>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions
> >> configuration in
> >>>>>>>>>>>>>>>>>>> Produced
> >>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level
> >> configuration in
> >>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>> class
> >>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in
> >> KStream#through - I
> >>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users
> >> possibility to
> >>>>>>>>>>>>> specify
> >>>>>>>>>>>>>>>>>>> num
> >>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :)
> Considering
> >> arguments
> >>>>>>>>>>>>> against
> >>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two
> >> options
> >>>>>>>>>>>>> here:
> >>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
> >>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
> >> num of
> >>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>> will be taken from source topic
> >>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
> will
> >> be auto
> >>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
> >> Produced<K, V>
> >>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
> >> specified num of
> >>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
> >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce
> >> new method
> >>>>>>>>>>> -
> >>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this
> >> in one of
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> threads)
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the
> >> following
> >>>>>>>>>>> plan:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
> >>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
> Grouped
> >> class (as
> >>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> >>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
> >> KStream#through
> >>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
> >> num of
> >>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>> will be taken from source topic
> >>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic
> will
> >> be auto
> >>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
> >>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
> >> Produced<K, V>
> >>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
> >> specified num of
> >>>>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
> >>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to
> Grouped
> >> class (as
> >>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> >>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for
> >> creating and
> >>>>>>>>>>>>> managing
> >>>>>>>>>>>>>>>>>>>>> internal repartition topics
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already
> >> discussed in the
> >>>>>>>>>>>>> mailing
> >>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were
> >> about this
> >>>>>>>>>>>>> KIP :(
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <mailto:
> levani.codes@gmail.com>>
> >> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around
> >> KIP-221. Going
> >>>>>>>>>>>>> through
> >>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement
> >> around
> >>>>>>>>>>>>> usefulness of
> >>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>> feature.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I
> >> understood, the
> >>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through
> >> method
> >>>>>>>>>>>>> (essentially
> >>>>>>>>>>>>>>>>>>>>> making topic name optional)
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
> >>>>>>>>>>> configuration
> >>>>>>>>>>>>>>>>>>> field.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to
> control
> >>>>>>>>>>>>> parallelism and
> >>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful
> operations.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
> >>>>>>>>>>>>> KStream#through if
> >>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>
>

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by Levani Kokhreidze <le...@gmail.com>.
Hello all,

While https://github.com/apache/kafka/pull/7170 <https://github.com/apache/kafka/pull/7170> is under review and it’s almost done, I want to resurrect discussion about this KIP to address couple of concerns raised by Matthias and John.

As a reminder, idea of the KIP-221 was to allow DSL users control over repartitioning and parallelism of sub-topologies by:
1) Introducing new KStream#repartition operation which is done in https://github.com/apache/kafka/pull/7170 <https://github.com/apache/kafka/pull/7170> 
2) Add new KStream#groupBy(Repartitioned) operation, which is planned to be separate PR.

While all agree about general implementation and idea behind https://github.com/apache/kafka/pull/7170 <https://github.com/apache/kafka/pull/7170> PR, introducing new KStream#groupBy(Repartitioned) method overload raised some questions during the review.
Matthias raised concern that there can be cases when user uses `KStream#groupBy(Repartitioned)` operation, but actual repartitioning may not required, thus configuration passed via `Repartitioned` would never be applied (Matthias, please correct me if I misinterpreted your comment). 
So instead, if user wants to control parallelism of sub-topologies, he or she should always use `KStream#repartition` operation before groupBy. Full comment can be seen here: https://github.com/apache/kafka/pull/7170#issuecomment-519303125 <https://github.com/apache/kafka/pull/7170#issuecomment-519303125> 

On the same topic, John pointed out that, from API design perspective, we shouldn’t intertwine configuration classes of different operators between one another. So instead of introducing new `KStream#groupBy(Repartitioned)` for specifying number of partitions for internal topic, we should update existing `Grouped` class with `numberOfPartitions` field.

Personally, I think Matthias’s concern is valid, but on the other hand Kafka Streams has already optimizer in place which alters topology independently from user. So maybe it makes sense if Kafka Streams, internally would optimize topology in the best way possible, even if in some cases this means ignoring some operator configurations passed by the user. Also, I agree with John about API design semantics. If we go through with the changes for `KStream#groupBy` operation, it makes more sense to add `numberOfPartitions` field to `Grouped` class instead of introducing new `KStream#groupBy(Repartitioned)` method overload. 

I would really appreciate communities feedback on this.

Kind regards,
Levani



> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <so...@confluent.io> wrote:
> 
> Hey Levani,
> 
> I think people are busy with the upcoming 2.4 release, and don't have much
> spare time at the
> moment. It's kind of a difficult time to get attention on things, but feel
> free to pick up something else
> to work on in the meantime until things have calmed down a bit!
> 
> Cheers,
> Sophie
> 
> 
> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <levani.codes@gmail.com <ma...@gmail.com>>
> wrote:
> 
>> Hello all,
>> 
>> Sorry for bringing this thread again, but I would like to get some
>> attention on this PR: https://github.com/apache/kafka/pull/7170 <https://github.com/apache/kafka/pull/7170> <
>> https://github.com/apache/kafka/pull/7170 <https://github.com/apache/kafka/pull/7170>>
>> It's been a while now and I would love to move on to other KIPs as well.
>> Please let me know if you have any concerns.
>> 
>> Regards,
>> Levani
>> 
>> 
>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <levani.codes@gmail.com <ma...@gmail.com>>
>> wrote:
>>> 
>>> Hi all,
>>> 
>>> Here’s voting thread for this KIP:
>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html> <
>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>>
>>> 
>>> Regards,
>>> Levani
>>> 
>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <levani.codes@gmail.com <ma...@gmail.com>
>> <mailto:levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>> 
>>>> Hi Matthias,
>>>> 
>>>> Thanks for the suggestion. I Don’t have strong opinion on that one.
>>>> Agree that avoiding unnecessary method overloads is a good idea.
>>>> 
>>>> Updated KIP
>>>> 
>>>> Regards,
>>>> Levani
>>>> 
>>>> 
>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <matthias@confluent.io <ma...@confluent.io>
>> <mailto:matthias@confluent.io <ma...@confluent.io>>> wrote:
>>>>> 
>>>>> One question:
>>>>> 
>>>>> Why do we add
>>>>> 
>>>>>> Repartitioned#with(final String name, final int numberOfPartitions)
>>>>> 
>>>>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
>>>>> combination with `withName()` and `withNumberOfPartitions()` should be
>>>>> sufficient. Users can chain the method calls.
>>>>> 
>>>>> (I think it's valuable to keep the number of overload small if
>> possible.)
>>>>> 
>>>>> Otherwise LGTM.
>>>>> 
>>>>> 
>>>>> -Matthias
>>>>> 
>>>>> 
>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>>>>>> Hello,
>>>>>> 
>>>>>> Thanks all for your feedback.
>>>>>> I started voting procedure for this KIP. If there’re any other
>> concerns about this KIP, please let me know.
>>>>>> 
>>>>>> Regards,
>>>>>> Levani
>>>>>> 
>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
>> levani.codes@gmail.com <ma...@gmail.com> <mailto:levani.codes@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>> 
>>>>>>> Hi Matthias,
>>>>>>> 
>>>>>>> Thanks for the suggestion, makes sense.
>>>>>>> I’ve updated KIP (
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>>
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
>>>> ).
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Levani
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <matthias@confluent.io <ma...@confluent.io>
>> <mailto:matthias@confluent.io <ma...@confluent.io>> <mailto:matthias@confluent.io <ma...@confluent.io> <mailto:
>> matthias@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>> 
>>>>>>>> Thanks for driving the KIP.
>>>>>>>> 
>>>>>>>> I agree that users need to be able to specify a partitioning
>> strategy.
>>>>>>>> 
>>>>>>>> Sophie raises a fair point about topic configs and producer
>> configs. My
>>>>>>>> take is, that consider `Repartitioned` as an "extension" to
>> `Produced`,
>>>>>>>> that adds topic configuration, is a good way to think about it and
>> helps
>>>>>>>> to keep the API "clean".
>>>>>>>> 
>>>>>>>> 
>>>>>>>> With regard to method names. I would prefer to avoid abbreviations.
>> Can
>>>>>>>> we rename:
>>>>>>>> 
>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>>>>>> 
>>>>>>>> Furthermore, it might be good to add some more `static` methods:
>>>>>>>> 
>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>>>>>> - Repartitioned.withNumberOfPartitions(int)
>>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -Matthias
>>>>>>>> 
>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>>>>>> Totally agree. I think in KStream interface it makes sense to have
>> some duplicate configurations between operators in order to keep API simple
>> and usable.
>>>>>>>>> Also, as more surface API has, harder it is to have proper
>> backward compatibility.
>>>>>>>>> While initial idea of keeping topic level configs separate was
>> exciting, having Repartitioned class encapsulate some producer level
>> configs makes API more readable.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Levani
>>>>>>>>> 
>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
>> sophie@confluent.io <ma...@confluent.io> <mailto:sophie@confluent.io <ma...@confluent.io>> <mailto:
>> sophie@confluent.io <ma...@confluent.io> <mailto:sophie@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> I think that is a good point about trying to keep producer level
>>>>>>>>>> configurations and (repartition) topic level considerations
>> separate.
>>>>>>>>>> Number of partitions is definitely purely a topic level
>> configuration. But
>>>>>>>>>> on some level, serdes and partitioners are just as much a topic
>>>>>>>>>> configuration as a producer one. You could have two producers
>> configured
>>>>>>>>>> with different serdes and/or partitioners, but if they are
>> writing to the
>>>>>>>>>> same topic the result would be very difficult to part. So in a
>> sense, these
>>>>>>>>>> are configurations of topics in Streams, not just producers.
>>>>>>>>>> 
>>>>>>>>>> Another way to think of it: while the Streams API is not always
>> true to
>>>>>>>>>> this, ideally all the relevant configs for an operator are
>> wrapped into a
>>>>>>>>>> single object (in this case, Repartitioned). We could instead
>> split out the
>>>>>>>>>> fields in common with Produced into a separate parameter to keep
>> topic and
>>>>>>>>>> producer level configurations separate, but this increases the
>> API surface
>>>>>>>>>> area by a lot. It's much more straightforward to just say "this is
>>>>>>>>>> everything that this particular operator needs" without worrying
>> about what
>>>>>>>>>> exactly you're specifying.
>>>>>>>>>> 
>>>>>>>>>> I suppose you could alternatively make Produced a field of
>> Repartitioned,
>>>>>>>>>> but I don't think we do this kind of composition elsewhere in
>> Streams at
>>>>>>>>>> the moment
>>>>>>>>>> 
>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
>> levani.codes@gmail.com <ma...@gmail.com> <mailto:levani.codes@gmail.com <ma...@gmail.com>> <mailto:
>> levani.codes@gmail.com <ma...@gmail.com> <mailto:levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Bill,
>>>>>>>>>>> 
>>>>>>>>>>> Thanks a lot for the feedback.
>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with
>> `Repartitioned#partitioner`
>>>>>>>>>>> configuration.
>>>>>>>>>>> In the beginning, I wanted to introduce a class for topic level
>>>>>>>>>>> configuration and keep topic level and producer level
>> configurations (such
>>>>>>>>>>> as Produced) separately (see my second email in this thread).
>>>>>>>>>>> But while looking at the semantics of KStream interface, I
>> couldn’t really
>>>>>>>>>>> figure out good operation name for Topic level configuration
>> class and just
>>>>>>>>>>> introducing `Topic` config class was kinda breaking the
>> semantics.
>>>>>>>>>>> So I think having Repartitioned class which encapsulates topic
>> and
>>>>>>>>>>> producer level configurations for internal topics is viable
>> thing to do.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Levani
>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <bbejeck@gmail.com <ma...@gmail.com>
>> <mailto:bbejeck@gmail.com <ma...@gmail.com>> <mailto:bbejeck@gmail.com <ma...@gmail.com> <mailto:
>> bbejeck@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Lavani,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm also a +1 for adding a partition option.  In addition to
>> the reason
>>>>>>>>>>>> provided by John, my reasoning is:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. Users may want to use something other than hash-based
>> partitioning
>>>>>>>>>>>> 2. Users may wish to partition on something different than the
>> key
>>>>>>>>>>>> without having to change the key.  For example:
>>>>>>>>>>>> 1. A combination of fields in the value in conjunction with
>> the key
>>>>>>>>>>>> 2. Something other than the key
>>>>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence in
>>>>>>>>>>>> KStream.to and KStream.through, so it makes sense for API
>> consistency.
>>>>>>>>>>>> 
>>>>>>>>>>>> Just my  2 cents.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Bill
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:levani.codes@gmail.com <ma...@gmail.com>> <mailto:
>> levani.codes@gmail.com <ma...@gmail.com> <mailto:levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In my mind it makes sense.
>>>>>>>>>>>>> If we add partitioner configuration to Repartitioned class,
>> with the
>>>>>>>>>>>>> combination of specifying number of partitions for internal
>> topics, user
>>>>>>>>>>>>> will have opportunity to ensure co-partitioning before join
>> operation.
>>>>>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>>>>>> Wondering what others think about this?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Levani
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <john@confluent.io <ma...@confluent.io>
>> <mailto:john@confluent.io <ma...@confluent.io>> <mailto:john@confluent.io <ma...@confluent.io> <mailto:
>> john@confluent.io <ma...@confluent.io>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally
>> sure it
>>>>>>>>>>>>>> makes sense, but I believe something similar is the rationale
>> for
>>>>>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hey John,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>>>>>> Do I understand this correctly, in your example I would
>> first issue
>>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner that
>> essentially
>>>>>>>>>>> would
>>>>>>>>>>>>> be the same as the topic I want to join with and then do the
>>>>>>>>>>> KStream#join
>>>>>>>>>>>>> with DSL?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
>> john@confluent.io <ma...@confluent.io> <mailto:john@confluent.io <ma...@confluent.io>> <mailto:john@confluent.io <ma...@confluent.io>
>> <mailto:john@confluent.io <ma...@confluent.io>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I think it might be useful to have an option to specify the
>>>>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may
>> get
>>>>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the
>> right-side
>>>>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
>>>>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the
>> same
>>>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss something
>> important?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
>>>>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com> <mailto:levani.codes@gmail.com <ma...@gmail.com>>
>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:levani.codes@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m not
>> sure
>>>>>>>>>>> about
>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would
>> need control
>>>>>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best
>> how to
>>>>>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for
>> topics
>>>>>>>>>>> that
>>>>>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to
>> specify
>>>>>>>>>>> the
>>>>>>>>>>>>> partitioner.
>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess
>> specifying the
>>>>>>>>>>>>> partitioner can be added as well if there’s agreement on this.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:sophie@confluent.io <ma...@confluent.io>> <mailto:
>> sophie@confluent.io <ma...@confluent.io>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned
>> would be a
>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to
>> Produced? I'm not
>>>>>>>>>>>>> sure how
>>>>>>>>>>>>>>>>>> widely this feature is really used, but seems it should be
>>>>>>>>>>> available
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over
>> the topic
>>>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that is
>> missing
>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by
>> configuring
>>>>>>>>>>> num
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
>>>>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:sophie@confluent.io <ma...@confluent.io>> <mailto:
>> sophie@confluent.io <ma...@confluent.io>>>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me --
>> for the
>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned,
>> will the
>>>>>>>>>>>>> topic be
>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case for
>> the
>>>>>>>>>>>>> signature
>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be
>> pre-created? The
>>>>>>>>>>>>> wording
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method
>> will
>>>>>>>>>>>>> auto-create
>>>>>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
>> <mailto:levani.codes@gmail.com <ma...@gmail.com> <mailto:levani.codes@gmail.com <ma...@gmail.com>>>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>>
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
>>>> 
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>> 
>>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
>>>>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns about
>> this KIP.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated
>> confluence
>>>>>>>>>>> page
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com
>>>>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic`
>> class for
>>>>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>>>> configuration for internal operators like `groupedBy`.
>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this as
>> well.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
>>>>>>>>>>>>> matthias@confluent.io
>>>>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for
>> summarizing
>>>>>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed
>> already (can't
>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to
>> refresh the
>>>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard to
>> the
>>>>>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and operators
>> that use
>>>>>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might indicate
>> that
>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses
>> use-defined
>>>>>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an
>> operator that
>>>>>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is
>> one
>>>>>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but
>> sometimes
>>>>>>>>>>> not; a
>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction
>> clearer.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am
>> wondering
>>>>>>>>>>>>> if the
>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding
>> it is
>>>>>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of
>> others on
>>>>>>>>>>>>> this, too.
>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow to
>>>>>>>>>>> configure
>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we
>> assume that
>>>>>>>>>>>>> a topic
>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method
>> instead of
>>>>>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other configs,
>> like
>>>>>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the
>> need to
>>>>>>>>>>>>> change the
>>>>>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like
>> them
>>>>>>>>>>> myself.
>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe enhancing
>> Produced
>>>>>>>>>>>>> with num
>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach.
>> Let me
>>>>>>>>>>>>> explain
>>>>>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this
>> configuration,
>>>>>>>>>>>>> this will
>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is
>> the final
>>>>>>>>>>>>> sink of
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable assumption
>> that user
>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that
>> case, having
>>>>>>>>>>>>> num of
>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API
>> contract, seems
>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that is
>> explicitly for
>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all
>> are producer
>>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
>>>>>>>>>>>>> configuration. And
>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
>> configurations
>>>>>>>>>>>>> together in
>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like
>> Produced
>>>>>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g topics
>> created
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we
>> should leave
>>>>>>>>>>>>> it as
>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we
>> should
>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should
>> create and
>>>>>>>>>>>>> manage
>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that
>> should be
>>>>>>>>>>>>> created in
>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions
>> configuration in
>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level
>> configuration in
>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in
>> KStream#through - I
>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users
>> possibility to
>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :) Considering
>> arguments
>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two
>> options
>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>> num of
>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic will
>> be auto
>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>> specified num of
>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce
>> new method
>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this
>> in one of
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the
>> following
>>>>>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to Grouped
>> class (as
>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
>> KStream#through
>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
>> num of
>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic will
>> be auto
>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
>> Produced<K, V>
>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
>> specified num of
>>>>>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to Grouped
>> class (as
>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for
>> creating and
>>>>>>>>>>>>> managing
>>>>>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already
>> discussed in the
>>>>>>>>>>>>> mailing
>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were
>> about this
>>>>>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around
>> KIP-221. Going
>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement
>> around
>>>>>>>>>>>>> usefulness of
>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I
>> understood, the
>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through
>> method
>>>>>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to control
>>>>>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful operations.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
>>>>>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani


Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Posted by Sophie Blee-Goldman <so...@confluent.io>.
Hey Levani,

I think people are busy with the upcoming 2.4 release, and don't have much
spare time at the
moment. It's kind of a difficult time to get attention on things, but feel
free to pick up something else
to work on in the meantime until things have calmed down a bit!

Cheers,
Sophie


On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze <le...@gmail.com>
wrote:

> Hello all,
>
> Sorry for bringing this thread again, but I would like to get some
> attention on this PR: https://github.com/apache/kafka/pull/7170 <
> https://github.com/apache/kafka/pull/7170>
> It's been a while now and I would love to move on to other KIPs as well.
> Please let me know if you have any concerns.
>
> Regards,
> Levani
>
>
> > On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze <le...@gmail.com>
> wrote:
> >
> > Hi all,
> >
> > Here’s voting thread for this KIP:
> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html <
> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>
> >
> > Regards,
> > Levani
> >
> >> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze <levani.codes@gmail.com
> <ma...@gmail.com>> wrote:
> >>
> >> Hi Matthias,
> >>
> >> Thanks for the suggestion. I Don’t have strong opinion on that one.
> >> Agree that avoiding unnecessary method overloads is a good idea.
> >>
> >> Updated KIP
> >>
> >> Regards,
> >> Levani
> >>
> >>
> >>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <matthias@confluent.io
> <ma...@confluent.io>> wrote:
> >>>
> >>> One question:
> >>>
> >>> Why do we add
> >>>
> >>>> Repartitioned#with(final String name, final int numberOfPartitions)
> >>>
> >>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
> >>> combination with `withName()` and `withNumberOfPartitions()` should be
> >>> sufficient. Users can chain the method calls.
> >>>
> >>> (I think it's valuable to keep the number of overload small if
> possible.)
> >>>
> >>> Otherwise LGTM.
> >>>
> >>>
> >>> -Matthias
> >>>
> >>>
> >>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
> >>>> Hello,
> >>>>
> >>>> Thanks all for your feedback.
> >>>> I started voting procedure for this KIP. If there’re any other
> concerns about this KIP, please let me know.
> >>>>
> >>>> Regards,
> >>>> Levani
> >>>>
> >>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <
> levani.codes@gmail.com <ma...@gmail.com>> wrote:
> >>>>>
> >>>>> Hi Matthias,
> >>>>>
> >>>>> Thanks for the suggestion, makes sense.
> >>>>> I’ve updated KIP (
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>).
> >>>>>
> >>>>> Regards,
> >>>>> Levani
> >>>>>
> >>>>>
> >>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <matthias@confluent.io
> <ma...@confluent.io> <mailto:matthias@confluent.io <mailto:
> matthias@confluent.io>>> wrote:
> >>>>>>
> >>>>>> Thanks for driving the KIP.
> >>>>>>
> >>>>>> I agree that users need to be able to specify a partitioning
> strategy.
> >>>>>>
> >>>>>> Sophie raises a fair point about topic configs and producer
> configs. My
> >>>>>> take is, that consider `Repartitioned` as an "extension" to
> `Produced`,
> >>>>>> that adds topic configuration, is a good way to think about it and
> helps
> >>>>>> to keep the API "clean".
> >>>>>>
> >>>>>>
> >>>>>> With regard to method names. I would prefer to avoid abbreviations.
> Can
> >>>>>> we rename:
> >>>>>>
> >>>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
> >>>>>>
> >>>>>> Furthermore, it might be good to add some more `static` methods:
> >>>>>>
> >>>>>> - Repartitioned.with(Serde<K>, Serde<V>)
> >>>>>> - Repartitioned.withNumberOfPartitions(int)
> >>>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
> >>>>>>
> >>>>>>
> >>>>>> -Matthias
> >>>>>>
> >>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
> >>>>>>> Totally agree. I think in KStream interface it makes sense to have
> some duplicate configurations between operators in order to keep API simple
> and usable.
> >>>>>>> Also, as more surface API has, harder it is to have proper
> backward compatibility.
> >>>>>>> While initial idea of keeping topic level configs separate was
> exciting, having Repartitioned class encapsulate some producer level
> configs makes API more readable.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Levani
> >>>>>>>
> >>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <
> sophie@confluent.io <ma...@confluent.io> <mailto:
> sophie@confluent.io <ma...@confluent.io>>> wrote:
> >>>>>>>>
> >>>>>>>> I think that is a good point about trying to keep producer level
> >>>>>>>> configurations and (repartition) topic level considerations
> separate.
> >>>>>>>> Number of partitions is definitely purely a topic level
> configuration. But
> >>>>>>>> on some level, serdes and partitioners are just as much a topic
> >>>>>>>> configuration as a producer one. You could have two producers
> configured
> >>>>>>>> with different serdes and/or partitioners, but if they are
> writing to the
> >>>>>>>> same topic the result would be very difficult to part. So in a
> sense, these
> >>>>>>>> are configurations of topics in Streams, not just producers.
> >>>>>>>>
> >>>>>>>> Another way to think of it: while the Streams API is not always
> true to
> >>>>>>>> this, ideally all the relevant configs for an operator are
> wrapped into a
> >>>>>>>> single object (in this case, Repartitioned). We could instead
> split out the
> >>>>>>>> fields in common with Produced into a separate parameter to keep
> topic and
> >>>>>>>> producer level configurations separate, but this increases the
> API surface
> >>>>>>>> area by a lot. It's much more straightforward to just say "this is
> >>>>>>>> everything that this particular operator needs" without worrying
> about what
> >>>>>>>> exactly you're specifying.
> >>>>>>>>
> >>>>>>>> I suppose you could alternatively make Produced a field of
> Repartitioned,
> >>>>>>>> but I don't think we do this kind of composition elsewhere in
> Streams at
> >>>>>>>> the moment
> >>>>>>>>
> >>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze <
> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> levani.codes@gmail.com <ma...@gmail.com>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Bill,
> >>>>>>>>>
> >>>>>>>>> Thanks a lot for the feedback.
> >>>>>>>>> Yes, that makes sense. I’ve updated KIP with
> `Repartitioned#partitioner`
> >>>>>>>>> configuration.
> >>>>>>>>> In the beginning, I wanted to introduce a class for topic level
> >>>>>>>>> configuration and keep topic level and producer level
> configurations (such
> >>>>>>>>> as Produced) separately (see my second email in this thread).
> >>>>>>>>> But while looking at the semantics of KStream interface, I
> couldn’t really
> >>>>>>>>> figure out good operation name for Topic level configuration
> class and just
> >>>>>>>>> introducing `Topic` config class was kinda breaking the
> semantics.
> >>>>>>>>> So I think having Repartitioned class which encapsulates topic
> and
> >>>>>>>>> producer level configurations for internal topics is viable
> thing to do.
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Levani
> >>>>>>>>>
> >>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <bbejeck@gmail.com
> <ma...@gmail.com> <mailto:bbejeck@gmail.com <mailto:
> bbejeck@gmail.com>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi Lavani,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for resurrecting this KIP.
> >>>>>>>>>>
> >>>>>>>>>> I'm also a +1 for adding a partition option.  In addition to
> the reason
> >>>>>>>>>> provided by John, my reasoning is:
> >>>>>>>>>>
> >>>>>>>>>> 1. Users may want to use something other than hash-based
> partitioning
> >>>>>>>>>> 2. Users may wish to partition on something different than the
> key
> >>>>>>>>>> without having to change the key.  For example:
> >>>>>>>>>>  1. A combination of fields in the value in conjunction with
> the key
> >>>>>>>>>>  2. Something other than the key
> >>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence in
> >>>>>>>>>> KStream.to and KStream.through, so it makes sense for API
> consistency.
> >>>>>>>>>>
> >>>>>>>>>> Just my  2 cents.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Bill
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
> >>>>>>>>> levani.codes@gmail.com <ma...@gmail.com> <mailto:
> levani.codes@gmail.com <ma...@gmail.com>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi John,
> >>>>>>>>>>>
> >>>>>>>>>>> In my mind it makes sense.
> >>>>>>>>>>> If we add partitioner configuration to Repartitioned class,
> with the
> >>>>>>>>>>> combination of specifying number of partitions for internal
> topics, user
> >>>>>>>>>>> will have opportunity to ensure co-partitioning before join
> operation.
> >>>>>>>>>>> I think this can be quite powerful feature.
> >>>>>>>>>>> Wondering what others think about this?
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Levani
> >>>>>>>>>>>
> >>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <john@confluent.io
> <ma...@confluent.io> <mailto:john@confluent.io <mailto:
> john@confluent.io>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally
> sure it
> >>>>>>>>>>>> makes sense, but I believe something similar is the rationale
> for
> >>>>>>>>>>>> having the partitioner option in Produced.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> -John
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
> >>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
> <mailto:levani.codes@gmail.com <ma...@gmail.com>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hey John,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Oh that’s interesting use-case.
> >>>>>>>>>>>>> Do I understand this correctly, in your example I would
> first issue
> >>>>>>>>>>> repartition(Repartitioned) with proper partitioner that
> essentially
> >>>>>>>>> would
> >>>>>>>>>>> be the same as the topic I want to join with and then do the
> >>>>>>>>> KStream#join
> >>>>>>>>>>> with DSL?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <
> john@confluent.io <ma...@confluent.io> <mailto:john@confluent.io
> <ma...@confluent.io>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hey, all, just to chime in,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I think it might be useful to have an option to specify the
> >>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may
> get
> >>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the
> right-side
> >>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
> >>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the
> same
> >>>>>>>>>>>>>> strategy.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Does that make sense, or did I maybe miss something
> important?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
> >>>>>>>>>>>>>> <levani.codes@gmail.com <ma...@gmail.com>
> <mailto:levani.codes@gmail.com <ma...@gmail.com>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m not
> sure
> >>>>>>>>> about
> >>>>>>>>>>> it yet.
> >>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would
> need control
> >>>>>>>>>>> over partitioner for internal topics.
> >>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best
> how to
> >>>>>>>>>>> partition data for internal topics.
> >>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for
> topics
> >>>>>>>>> that
> >>>>>>>>>>> are created by user In advance.
> >>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to
> specify
> >>>>>>>>> the
> >>>>>>>>>>> partitioner.
> >>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess
> specifying the
> >>>>>>>>>>> partitioner can be added as well if there’s agreement on this.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
> >>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> sophie@confluent.io <ma...@confluent.io>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned
> would be a
> >>>>>>>>>>> useful
> >>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
> >>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to
> Produced? I'm not
> >>>>>>>>>>> sure how
> >>>>>>>>>>>>>>>> widely this feature is really used, but seems it should be
> >>>>>>>>> available
> >>>>>>>>>>> for
> >>>>>>>>>>>>>>>> repartition topics.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
> >>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> <mailto:levani.codes@gmail.com <ma...@gmail.com>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hey Sophie,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> In both cases KStream#repartition and
> >>>>>>>>>>> KStream#repartition(Repartitioned)
> >>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
> >>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over
> the topic
> >>>>>>>>>>> such as
> >>>>>>>>>>>>>>>>> num of partitions.
> >>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that is
> missing
> >>>>>>>>> in
> >>>>>>>>>>>>>>>>> current DSL design.
> >>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by
> configuring
> >>>>>>>>> num
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> partitions for internal topics.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hope this answers your question.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
> >>>>>>>>>>> sophie@confluent.io <ma...@confluent.io> <mailto:
> sophie@confluent.io <ma...@confluent.io>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hey Levani,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me --
> for the
> >>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned,
> will the
> >>>>>>>>>>> topic be
> >>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case for
> the
> >>>>>>>>>>> signature
> >>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be
> pre-created? The
> >>>>>>>>>>> wording
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method
> will
> >>>>>>>>>>> auto-create
> >>>>>>>>>>>>>>>>>> topics while the other will not.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>> Sophie
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
> >>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>
> <mailto:levani.codes@gmail.com <ma...@gmail.com>>>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>
> >>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >
> >>>>>>>>>>>>>>>>>> )
> >>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
> >>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns about
> this KIP.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
> >>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated
> confluence
> >>>>>>>>> page
> >>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>> the new proposal
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> King regards,
> >>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>>> levani.codes@gmail.com
> >>>>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hello Matthias,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
> >>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic`
> class for
> >>>>>>>>>>> topic
> >>>>>>>>>>>>>>>>>>> configuration for internal operators like `groupedBy`.
> >>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this as
> well.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
> >>>>>>>>>>> matthias@confluent.io
> >>>>>>>>>>>>>>>>>>> <ma...@confluent.io>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Levani,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for
> summarizing
> >>>>>>>>>>>>>>>>> everything.
> >>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed
> already (can't
> >>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to
> refresh the
> >>>>>>>>>>>>>>>>>>> discussion.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard to
> the
> >>>>>>>>>>> distinction
> >>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and operators
> that use
> >>>>>>>>>>>>>>>>>>> user-created
> >>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might indicate
> that
> >>>>>>>>>>> leaving
> >>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses
> use-defined
> >>>>>>>>>>> topics) and
> >>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an
> operator that
> >>>>>>>>>>> manages
> >>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is
> one
> >>>>>>>>> operator
> >>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but
> sometimes
> >>>>>>>>> not; a
> >>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction
> clearer.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am
> wondering
> >>>>>>>>>>> if the
> >>>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding
> it is
> >>>>>>>>>>> semantically
> >>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of
> others on
> >>>>>>>>>>> this, too.
> >>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>> am
> >>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow to
> >>>>>>>>> configure
> >>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>> certain
> >>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we
> assume that
> >>>>>>>>>>> a topic
> >>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
> >>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method
> instead of
> >>>>>>>>>>> a plain
> >>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other configs,
> like
> >>>>>>>>>>> replication
> >>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the
> need to
> >>>>>>>>>>> change the
> >>>>>>>>>>>>>>>>>>> "main
> >>>>>>>>>>>>>>>>>>>>>>> API".
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like
> them
> >>>>>>>>> myself.
> >>>>>>>>>>> :)
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
> >>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe enhancing
> Produced
> >>>>>>>>>>> with num
> >>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach.
> Let me
> >>>>>>>>>>> explain
> >>>>>>>>>>>>>>>>> why:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this
> configuration,
> >>>>>>>>>>> this will
> >>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is
> the final
> >>>>>>>>>>> sink of
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable assumption
> that user
> >>>>>>>>>>> needs
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that
> case, having
> >>>>>>>>>>> num of
> >>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API
> contract, seems
> >>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>> Produced is designed to be something that is
> explicitly for
> >>>>>>>>>>> producer
> >>>>>>>>>>>>>>>>> (key
> >>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all
> are producer
> >>>>>>>>>>>>>>>>> specific
> >>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
> >>>>>>>>>>> configuration. And
> >>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level
> configurations
> >>>>>>>>>>> together in
> >>>>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>>>> class is the good approach.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like
> Produced
> >>>>>>>>>>> parameter is
> >>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g topics
> created
> >>>>>>>>> and
> >>>>>>>>>>>>>>>>> managed
> >>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we
> should leave
> >>>>>>>>>>> it as
> >>>>>>>>>>>>>>>>> it is
> >>>>>>>>>>>>>>>>>>> in that case.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we
> should
> >>>>>>>>>>> distinguish
> >>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should
> create and
> >>>>>>>>>>> manage
> >>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that
> should be
> >>>>>>>>>>> created in
> >>>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions
> configuration in
> >>>>>>>>>>>>>>>>> Produced
> >>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level
> configuration in
> >>>>>>>>>>> one
> >>>>>>>>>>>>>>>>> class
> >>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in
> KStream#through - I
> >>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users
> possibility to
> >>>>>>>>>>> specify
> >>>>>>>>>>>>>>>>> num
> >>>>>>>>>>>>>>>>>>> of partitions there is even more useful :) Considering
> arguments
> >>>>>>>>>>> against
> >>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two
> options
> >>>>>>>>>>> here:
> >>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
> >>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
> num of
> >>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>> will be taken from source topic
> >>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic will
> be auto
> >>>>>>>>>>>>>>>>>>> generated with specified num of partitions
> >>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
> Produced<K, V>
> >>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
> specified num of
> >>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
> >>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce
> new method
> >>>>>>>>> -
> >>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this
> in one of
> >>>>>>>>> the
> >>>>>>>>>>>>>>>>> threads)
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the
> following
> >>>>>>>>> plan:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Option A:
> >>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> >>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to Grouped
> class (as
> >>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> >>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to
> KStream#through
> >>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and
> num of
> >>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>> will be taken from source topic
> >>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic will
> be auto
> >>>>>>>>>>>>>>>>>>> generated with specified num of partitions
> >>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final
> Produced<K, V>
> >>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with
> specified num of
> >>>>>>>>>>>>>>>>> partitions
> >>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Option B:
> >>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
> >>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to Grouped
> class (as
> >>>>>>>>>>>>>>>>>>> mentioned in the KIP)
> >>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for
> creating and
> >>>>>>>>>>> managing
> >>>>>>>>>>>>>>>>>>> internal repartition topics
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already
> discussed in the
> >>>>>>>>>>> mailing
> >>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were
> about this
> >>>>>>>>>>> KIP :(
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
> >>>>>>>>>>>>>>>>>>> levani.codes@gmail.com <ma...@gmail.com>>
> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around
> KIP-221. Going
> >>>>>>>>>>> through
> >>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement
> around
> >>>>>>>>>>> usefulness of
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> feature.
> >>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I
> understood, the
> >>>>>>>>>>> most
> >>>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through
> method
> >>>>>>>>>>> (essentially
> >>>>>>>>>>>>>>>>>>> making topic name optional)
> >>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
> >>>>>>>>> configuration
> >>>>>>>>>>>>>>>>> field.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to control
> >>>>>>>>>>> parallelism and
> >>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful operations.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
> >>>>>>>>>>> KStream#through if
> >>>>>>>>>>>>>>>>>>> this changes sound sensible.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>>>>>>>>>>>> Levani
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >
>
>