You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Kishore Senji <ks...@gmail.com> on 2016/03/18 00:28:54 UTC

Kafka Streams scaling questions

Hi All,

I read through the doc on KStreams here:
http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple
<http://www.google.com/url?q=http%3A%2F%2Fwww.confluent.io%2Fblog%2Fintroducing-kafka-streams-stream-processing-made-simple&sa=D&sntz=1&usg=AFQjCNGJu-bDlzStDwxPDIOKpG10Ts9xvA>

I was wondering about how an use-case that I have be solved with KStream?

Use-case: Logs are obtained from a service pool. It contains many nodes. We
want to alert if a particular consumer (identified by consumer id) is
making calls to the service more than X number of times in the last 1 min.
The data is available in logs similar to access logs for example. The
window is a sliding window. We have to check back 60s from the current
event and see if in total (with the current event) it would exceed the
threshold. Logs are pushed to Kafka using a random partitioner whose range
is [1 to n] where n is the total number of partitions.

One way of achieving this is to push data in to the first Kafka topic
(using random partitioning) and then a set of KStream tasks re-shuffling
the data on consumer_id in to the second topic. The next set of KStream
tasks operate on the second topic (1 task/partition) and do the
aggregation. If this is an acceptable solution, here are my questions on
scaling.


   - I can see that the second topic is prone to hotspots. If we get
   billions of requests for a given consumer_id and only few hundreds for
   another consumer_id, the second kafka topic partitions will become hotspots
   (and the partition getting lot of volume of logs can suffocate other
   partitions on the same broker). If we try to create more partitions and
   probably isolate the partition getting lot of volume, this wastes resources.
   - The max parallelism that we can get for a KStream task is the number
   of partitions - this may work for a single stream. How would this work for
   a multi-tenant stream processing where people want to write multiple stream
   jobs on the same set of data? If the parallelism does not work, they would
   have to copy and group the data in to another topic with more partitions. I
   think like we need two knobs one for scaling Kafka (number of partitions)
   and one for scaling stream. It sounds like with KStream it is only one knob
   for both.
   - How would we deal with organic growth of data? Let us say the
   partitions we chose for the second topic (where it is grouped by
   consumer_id) is not enough to deal with organic growth in volume. If we
   increase partitions, for a given consumer some data could be in one
   partition before the flex up and data could end up in a different topic
   after flex up. Since the KStream jobs are unique per partition and are
   stateless across them, the aggregated result would be incorrect, unless we
   have only one job to read all the data in which case it will become a
   bottleneck.

In storm (or any other streaming engine), the way to solve it would be to
have only one topic (partitioned n-ways) and data pushed in to using a
random partitioner (so no hotspots and scaling issues). We will have n
Spouts reading data from those partitions and we can then have m bolts
getting the data using fields grouping on consumer_id. Since all the data
for a given consumer_id ends up in a bolt we will do the sliding window and
the alert.

If we solve it the Storm way in KStream, we would only have one topic
(partitioned
n-ways) and data pushed in to using a random partitioner (so no hotspots
and scaling issues). But we can only have one KStream task running reading
all the data and doing the windowing and aggregation. This will become a
bottleneck for scaling.

So it sounds like KStreams will either have "hotspots" in kafka topic (as
each partition needs to have the data that the KStream task needs and work
independently) or scaling issues in the KStream task for "aggregation".

How would one solve this kind of problems with KStream?



<https://lh3.googleusercontent.com/-uxCWSNe6nw8/Vusvqd5nLAI/AAAAAAAAD5s/aqY-DexRkn097M9egCLLb3D_ANyCKm60w/s1600/Screen%2BShot%2B2016-03-17%2Bat%2B3.30.17%2BPM.png>

Re: Kafka Streams scaling questions

Posted by Kishore Senji <ks...@gmail.com>.
Hi Ben,

Thank you for the reply. Let us say we are processing apache access logs
(or something similar) for a service which is served by n nodes. We would
want to process 1) errors / consumer (identified by client ip or something
in headers) 2) errors / URL.

These are different group-by that we need to do on top of the same data.
The way this can be done in Storm (or any Stream processing engine where
group-by is supported in runtime) is to push the data in to Kafka in a
round-robin fashion and in Stream processing, we use custom group by on the
required dimension. So we will have two topologies where one does fields
grouping on consumer and the other does fields group on URL. One Bolt would
get the data for that time window and it will look at each log line and
check for errors and increment the error count on the appropriate group-by.
(we can have n partitions and m bolts where m > n). At the end of the
tumbling window (or sliding), it can emit the count of errors per dimension
(consumer or URL). In this approach there are no issues with Kafka scaling
(as load is evenly distributed and if we had to flex up we can flex up the
partitions for organic growth). Scaling Storm topology will also be
similar, we can increase parallelism as needed (we can upfront choose more
tasks and less parallelism and later use that to increase parallelism or we
might restart topology with appropriate parallelism hints). This is all
assuming a single bolt can deal with data from a given consumer for that
time window (i.e the capacity is <1 and there is no offset lag). Here we
can have only one source topic and two runtime streams reading the data and
processing it in two different ways. We can persist the window data and the
offset (or we can only persist the offset which starts the window in the
case of tumbling windows) and re-create the appropriate window and the
aggregation will be accurate. If there alerts set up on these counts, we
won't miss any alerts.

In KStreams - We need to have two source topics with partitions done using
those group-bys (yes we can re-group in the DSL but it will create an
internal topic). Because we need to group-by upfront by consumer, it can so
happen that the k consumers that fall in to one partition could be
generating lot of logs and so even though we have enough capacity overall
in the cluster, this one partition could be hot. On the Stream side, since
we can only have one partition be read by one Stream task, one Stream task
will be reading all the partition data (all k consumers data) and since the
combined volume of k consumers is much higher, this one Stream task will
also be a bottleneck. Let us say we try to increase the number of
partitions to split the hot partition and spread the load, this will now
start pushing data for a given consumer in to a different partition. The
Stream jobs will need to be increased as well appropriately, but the old
Stream task dealing with one consumer data will now have to make sure it
reads the data from the other partition, otherwise the aggregations for
that particular window will be incorrect. It is very difficult to manage
this mapping changes conveyed to stream tasks and get the accurate results.
So when we flex up one would have to drop the windows and re-start.

I know from Kafka perspective the recommendation is to make sure the key
chosen for semantic partitioning also spreads the load. For this use-case,
consumer may not a good semantic partitioning key, but since we still want
to achieve the use-case, would that mean KStreams is not a good fit for
cases where the semantic partitioning key does not guarantee load
distribution? Even URL is not a good key because we can get URLs which get
the bulk of the traffic and also the upper limit for partitions is the
number of different types of URLs.

For these log aggregation uses cases, relying on the partitioning scheme of
Kafka for streaming will create issues like the above if we want to have
some stateful processing (like aggregation in windows etc). If it is purely
stateless processing then it may be fine (and in which case round-robin can
work in Kafka). The only thing we lose with the other round-robin approach
is the ordering guarantees and if that is not an issue, I think it is more
flexible.

Thanks,
Kishore.



On Tue, Mar 22, 2016 at 9:02 AM, Ben Stopford <be...@confluent.io> wrote:

> Hi Kishore
>
> In general I think it’s up to you to choose keys that keep related data
> together, but also give you reasonable load balancing. I’m afraid that I’m
> not sure I fully followed your explanation of how storm solves this problem
> more efficiently though.
>
> I noticed you asked:  "How would this work for a multi-tenant stream
> processing where people want to write multiple stream jobs on the same set
> of data?” - I think this is simply that Consumer Group behaviour. Different
> applications would get different consumer groups (application ids in
> KStreams), giving them independent parallelism over the same data.
>
> In theory there is another knob to consider. Consumers (actually the
> leader consumer) can control which partitions they get assigned. KStreams
> already uses this feature to do things like create stand by replicas, but I
> don’t think (but I may be wrong) this helps you with your problem directly.
>
> All the best
>
> B
>
> > On 21 Mar 2016, at 03:12, Kishore Senji <ks...@gmail.com> wrote:
> >
> > I will scale back the question to get some replies :)
> >
> > Suppose the use-case is to build a monitoring platform -
> > For log aggregation from thousands of nodes, I believe that a Kafka topic
> > should be partitioned n-ways and the data should be sprayed in a
> > round-robin fashion to get a good even distribution of data (because we
> > don't know upfront how the data is sliced by semantically and we don't
> know
> > whether the key for semantic partitioning gives a even distribution of
> > data). Later in stream processing, the appropriate group-bys would be
> done
> > on the same source of data to support various ways of slicing.
> >
> >
> > http://kafka.apache.org/documentation.html#design_loadbalancing - "This
> can
> > be done at random, implementing a kind of random load balancing, or it
> can
> > be done by some semantic partitioning function"
> > http://kafka.apache.org/documentation.html#basic_ops_modify_topic - "Be
> > aware that one use case for partitions is to semantically partition data,
> > and adding partitions doesn't change the partitioning of existing data so
> > this may disturb consumers if they rely on that partition"
> >
> > The above docs caution the use of semantic partitioning as it can lead to
> > uneven distribution (hotspots) if the semantic key does not give even
> > distribution, plus on a flex up of partitions the data would now be in
> two
> > partitions. For these reasons, I strongly believe the data should be
> pushed
> > to Kafka in a round-robin fashion and later a Stream processing framework
> > should use the appropriate group-bys (this also gives us the flexibility
> to
> > slice in different ways as well at runtime)
> >
> > KStreams let us do stream processing on a partition of data. So to do
> > windowed aggregation, the data for the same key should be in the same
> > partition. This means to use KStreams we have to use Semantic
> partitioning,
> > which will have the above issues as shown in Kafka docs. So my question
> is -
> >
> > If we want to use KStreams how should we deal with "Load balancing" (it
> can
> > happen that the semantic partitioning can overload a single partition and
> > so Kafka partition will be overloaded as well as the KStream task)  and
> > "Flex up of partitions" (more than one partition will have data for a
> given
> > key and so the windowed aggregations result in incorrect data)?
> >
> > Thanks,
> > Kishore.
> >
> > On Thu, Mar 17, 2016 at 4:28 PM, Kishore Senji <ks...@gmail.com> wrote:
> >
> >> Hi All,
> >>
> >> I read through the doc on KStreams here:
> >>
> http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple
> >> <
> http://www.google.com/url?q=http%3A%2F%2Fwww.confluent.io%2Fblog%2Fintroducing-kafka-streams-stream-processing-made-simple&sa=D&sntz=1&usg=AFQjCNGJu-bDlzStDwxPDIOKpG10Ts9xvA
> >
> >>
> >> I was wondering about how an use-case that I have be solved with
> KStream?
> >>
> >> Use-case: Logs are obtained from a service pool. It contains many nodes.
> >> We want to alert if a particular consumer (identified by consumer id) is
> >> making calls to the service more than X number of times in the last 1
> min.
> >> The data is available in logs similar to access logs for example. The
> >> window is a sliding window. We have to check back 60s from the current
> >> event and see if in total (with the current event) it would exceed the
> >> threshold. Logs are pushed to Kafka using a random partitioner whose
> range
> >> is [1 to n] where n is the total number of partitions.
> >>
> >> One way of achieving this is to push data in to the first Kafka topic
> >> (using random partitioning) and then a set of KStream tasks re-shuffling
> >> the data on consumer_id in to the second topic. The next set of KStream
> >> tasks operate on the second topic (1 task/partition) and do the
> >> aggregation. If this is an acceptable solution, here are my questions on
> >> scaling.
> >>
> >>
> >>   - I can see that the second topic is prone to hotspots. If we get
> >>   billions of requests for a given consumer_id and only few hundreds for
> >>   another consumer_id, the second kafka topic partitions will become
> hotspots
> >>   (and the partition getting lot of volume of logs can suffocate other
> >>   partitions on the same broker). If we try to create more partitions
> and
> >>   probably isolate the partition getting lot of volume, this wastes
> resources.
> >>   - The max parallelism that we can get for a KStream task is the number
> >>   of partitions - this may work for a single stream. How would this
> work for
> >>   a multi-tenant stream processing where people want to write multiple
> stream
> >>   jobs on the same set of data? If the parallelism does not work, they
> would
> >>   have to copy and group the data in to another topic with more
> partitions. I
> >>   think like we need two knobs one for scaling Kafka (number of
> partitions)
> >>   and one for scaling stream. It sounds like with KStream it is only
> one knob
> >>   for both.
> >>   - How would we deal with organic growth of data? Let us say the
> >>   partitions we chose for the second topic (where it is grouped by
> >>   consumer_id) is not enough to deal with organic growth in volume. If
> we
> >>   increase partitions, for a given consumer some data could be in one
> >>   partition before the flex up and data could end up in a different
> topic
> >>   after flex up. Since the KStream jobs are unique per partition and are
> >>   stateless across them, the aggregated result would be incorrect,
> unless we
> >>   have only one job to read all the data in which case it will become a
> >>   bottleneck.
> >>
> >> In storm (or any other streaming engine), the way to solve it would be
> to
> >> have only one topic (partitioned n-ways) and data pushed in to using a
> >> random partitioner (so no hotspots and scaling issues). We will have n
> >> Spouts reading data from those partitions and we can then have m bolts
> >> getting the data using fields grouping on consumer_id. Since all the
> data
> >> for a given consumer_id ends up in a bolt we will do the sliding window
> and
> >> the alert.
> >>
> >> If we solve it the Storm way in KStream, we would only have one topic
> (partitioned
> >> n-ways) and data pushed in to using a random partitioner (so no hotspots
> >> and scaling issues). But we can only have one KStream task running
> reading
> >> all the data and doing the windowing and aggregation. This will become a
> >> bottleneck for scaling.
> >>
> >> So it sounds like KStreams will either have "hotspots" in kafka topic
> (as
> >> each partition needs to have the data that the KStream task needs and
> work
> >> independently) or scaling issues in the KStream task for "aggregation".
> >>
> >> How would one solve this kind of problems with KStream?
> >>
> >>
> >>
> >>
> >> <
> https://lh3.googleusercontent.com/-uxCWSNe6nw8/Vusvqd5nLAI/AAAAAAAAD5s/aqY-DexRkn097M9egCLLb3D_ANyCKm60w/s1600/Screen%2BShot%2B2016-03-17%2Bat%2B3.30.17%2BPM.png
> >
> >>
>
>

Re: Kafka Streams scaling questions

Posted by Ben Stopford <be...@confluent.io>.
Hi Kishore

In general I think it’s up to you to choose keys that keep related data together, but also give you reasonable load balancing. I’m afraid that I’m not sure I fully followed your explanation of how storm solves this problem more efficiently though.

I noticed you asked:  "How would this work for a multi-tenant stream processing where people want to write multiple stream jobs on the same set of data?” - I think this is simply that Consumer Group behaviour. Different applications would get different consumer groups (application ids in KStreams), giving them independent parallelism over the same data. 

In theory there is another knob to consider. Consumers (actually the leader consumer) can control which partitions they get assigned. KStreams already uses this feature to do things like create stand by replicas, but I don’t think (but I may be wrong) this helps you with your problem directly.  

All the best

B 

> On 21 Mar 2016, at 03:12, Kishore Senji <ks...@gmail.com> wrote:
> 
> I will scale back the question to get some replies :)
> 
> Suppose the use-case is to build a monitoring platform -
> For log aggregation from thousands of nodes, I believe that a Kafka topic
> should be partitioned n-ways and the data should be sprayed in a
> round-robin fashion to get a good even distribution of data (because we
> don't know upfront how the data is sliced by semantically and we don't know
> whether the key for semantic partitioning gives a even distribution of
> data). Later in stream processing, the appropriate group-bys would be done
> on the same source of data to support various ways of slicing.
> 
> 
> http://kafka.apache.org/documentation.html#design_loadbalancing - "This can
> be done at random, implementing a kind of random load balancing, or it can
> be done by some semantic partitioning function"
> http://kafka.apache.org/documentation.html#basic_ops_modify_topic - "Be
> aware that one use case for partitions is to semantically partition data,
> and adding partitions doesn't change the partitioning of existing data so
> this may disturb consumers if they rely on that partition"
> 
> The above docs caution the use of semantic partitioning as it can lead to
> uneven distribution (hotspots) if the semantic key does not give even
> distribution, plus on a flex up of partitions the data would now be in two
> partitions. For these reasons, I strongly believe the data should be pushed
> to Kafka in a round-robin fashion and later a Stream processing framework
> should use the appropriate group-bys (this also gives us the flexibility to
> slice in different ways as well at runtime)
> 
> KStreams let us do stream processing on a partition of data. So to do
> windowed aggregation, the data for the same key should be in the same
> partition. This means to use KStreams we have to use Semantic partitioning,
> which will have the above issues as shown in Kafka docs. So my question is -
> 
> If we want to use KStreams how should we deal with "Load balancing" (it can
> happen that the semantic partitioning can overload a single partition and
> so Kafka partition will be overloaded as well as the KStream task)  and
> "Flex up of partitions" (more than one partition will have data for a given
> key and so the windowed aggregations result in incorrect data)?
> 
> Thanks,
> Kishore.
> 
> On Thu, Mar 17, 2016 at 4:28 PM, Kishore Senji <ks...@gmail.com> wrote:
> 
>> Hi All,
>> 
>> I read through the doc on KStreams here:
>> http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple
>> <http://www.google.com/url?q=http%3A%2F%2Fwww.confluent.io%2Fblog%2Fintroducing-kafka-streams-stream-processing-made-simple&sa=D&sntz=1&usg=AFQjCNGJu-bDlzStDwxPDIOKpG10Ts9xvA>
>> 
>> I was wondering about how an use-case that I have be solved with KStream?
>> 
>> Use-case: Logs are obtained from a service pool. It contains many nodes.
>> We want to alert if a particular consumer (identified by consumer id) is
>> making calls to the service more than X number of times in the last 1 min.
>> The data is available in logs similar to access logs for example. The
>> window is a sliding window. We have to check back 60s from the current
>> event and see if in total (with the current event) it would exceed the
>> threshold. Logs are pushed to Kafka using a random partitioner whose range
>> is [1 to n] where n is the total number of partitions.
>> 
>> One way of achieving this is to push data in to the first Kafka topic
>> (using random partitioning) and then a set of KStream tasks re-shuffling
>> the data on consumer_id in to the second topic. The next set of KStream
>> tasks operate on the second topic (1 task/partition) and do the
>> aggregation. If this is an acceptable solution, here are my questions on
>> scaling.
>> 
>> 
>>   - I can see that the second topic is prone to hotspots. If we get
>>   billions of requests for a given consumer_id and only few hundreds for
>>   another consumer_id, the second kafka topic partitions will become hotspots
>>   (and the partition getting lot of volume of logs can suffocate other
>>   partitions on the same broker). If we try to create more partitions and
>>   probably isolate the partition getting lot of volume, this wastes resources.
>>   - The max parallelism that we can get for a KStream task is the number
>>   of partitions - this may work for a single stream. How would this work for
>>   a multi-tenant stream processing where people want to write multiple stream
>>   jobs on the same set of data? If the parallelism does not work, they would
>>   have to copy and group the data in to another topic with more partitions. I
>>   think like we need two knobs one for scaling Kafka (number of partitions)
>>   and one for scaling stream. It sounds like with KStream it is only one knob
>>   for both.
>>   - How would we deal with organic growth of data? Let us say the
>>   partitions we chose for the second topic (where it is grouped by
>>   consumer_id) is not enough to deal with organic growth in volume. If we
>>   increase partitions, for a given consumer some data could be in one
>>   partition before the flex up and data could end up in a different topic
>>   after flex up. Since the KStream jobs are unique per partition and are
>>   stateless across them, the aggregated result would be incorrect, unless we
>>   have only one job to read all the data in which case it will become a
>>   bottleneck.
>> 
>> In storm (or any other streaming engine), the way to solve it would be to
>> have only one topic (partitioned n-ways) and data pushed in to using a
>> random partitioner (so no hotspots and scaling issues). We will have n
>> Spouts reading data from those partitions and we can then have m bolts
>> getting the data using fields grouping on consumer_id. Since all the data
>> for a given consumer_id ends up in a bolt we will do the sliding window and
>> the alert.
>> 
>> If we solve it the Storm way in KStream, we would only have one topic (partitioned
>> n-ways) and data pushed in to using a random partitioner (so no hotspots
>> and scaling issues). But we can only have one KStream task running reading
>> all the data and doing the windowing and aggregation. This will become a
>> bottleneck for scaling.
>> 
>> So it sounds like KStreams will either have "hotspots" in kafka topic (as
>> each partition needs to have the data that the KStream task needs and work
>> independently) or scaling issues in the KStream task for "aggregation".
>> 
>> How would one solve this kind of problems with KStream?
>> 
>> 
>> 
>> 
>> <https://lh3.googleusercontent.com/-uxCWSNe6nw8/Vusvqd5nLAI/AAAAAAAAD5s/aqY-DexRkn097M9egCLLb3D_ANyCKm60w/s1600/Screen%2BShot%2B2016-03-17%2Bat%2B3.30.17%2BPM.png>
>> 


Re: Kafka Streams scaling questions

Posted by Kishore Senji <ks...@gmail.com>.
I will scale back the question to get some replies :)

Suppose the use-case is to build a monitoring platform -
For log aggregation from thousands of nodes, I believe that a Kafka topic
should be partitioned n-ways and the data should be sprayed in a
round-robin fashion to get a good even distribution of data (because we
don't know upfront how the data is sliced by semantically and we don't know
whether the key for semantic partitioning gives a even distribution of
data). Later in stream processing, the appropriate group-bys would be done
on the same source of data to support various ways of slicing.


http://kafka.apache.org/documentation.html#design_loadbalancing - "This can
be done at random, implementing a kind of random load balancing, or it can
be done by some semantic partitioning function"
http://kafka.apache.org/documentation.html#basic_ops_modify_topic - "Be
aware that one use case for partitions is to semantically partition data,
and adding partitions doesn't change the partitioning of existing data so
this may disturb consumers if they rely on that partition"

The above docs caution the use of semantic partitioning as it can lead to
uneven distribution (hotspots) if the semantic key does not give even
distribution, plus on a flex up of partitions the data would now be in two
partitions. For these reasons, I strongly believe the data should be pushed
to Kafka in a round-robin fashion and later a Stream processing framework
should use the appropriate group-bys (this also gives us the flexibility to
slice in different ways as well at runtime)

KStreams let us do stream processing on a partition of data. So to do
windowed aggregation, the data for the same key should be in the same
partition. This means to use KStreams we have to use Semantic partitioning,
which will have the above issues as shown in Kafka docs. So my question is -

If we want to use KStreams how should we deal with "Load balancing" (it can
happen that the semantic partitioning can overload a single partition and
so Kafka partition will be overloaded as well as the KStream task)  and
"Flex up of partitions" (more than one partition will have data for a given
key and so the windowed aggregations result in incorrect data)?

Thanks,
Kishore.

On Thu, Mar 17, 2016 at 4:28 PM, Kishore Senji <ks...@gmail.com> wrote:

> Hi All,
>
> I read through the doc on KStreams here:
> http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple
> <http://www.google.com/url?q=http%3A%2F%2Fwww.confluent.io%2Fblog%2Fintroducing-kafka-streams-stream-processing-made-simple&sa=D&sntz=1&usg=AFQjCNGJu-bDlzStDwxPDIOKpG10Ts9xvA>
>
> I was wondering about how an use-case that I have be solved with KStream?
>
> Use-case: Logs are obtained from a service pool. It contains many nodes.
> We want to alert if a particular consumer (identified by consumer id) is
> making calls to the service more than X number of times in the last 1 min.
> The data is available in logs similar to access logs for example. The
> window is a sliding window. We have to check back 60s from the current
> event and see if in total (with the current event) it would exceed the
> threshold. Logs are pushed to Kafka using a random partitioner whose range
> is [1 to n] where n is the total number of partitions.
>
> One way of achieving this is to push data in to the first Kafka topic
> (using random partitioning) and then a set of KStream tasks re-shuffling
> the data on consumer_id in to the second topic. The next set of KStream
> tasks operate on the second topic (1 task/partition) and do the
> aggregation. If this is an acceptable solution, here are my questions on
> scaling.
>
>
>    - I can see that the second topic is prone to hotspots. If we get
>    billions of requests for a given consumer_id and only few hundreds for
>    another consumer_id, the second kafka topic partitions will become hotspots
>    (and the partition getting lot of volume of logs can suffocate other
>    partitions on the same broker). If we try to create more partitions and
>    probably isolate the partition getting lot of volume, this wastes resources.
>    - The max parallelism that we can get for a KStream task is the number
>    of partitions - this may work for a single stream. How would this work for
>    a multi-tenant stream processing where people want to write multiple stream
>    jobs on the same set of data? If the parallelism does not work, they would
>    have to copy and group the data in to another topic with more partitions. I
>    think like we need two knobs one for scaling Kafka (number of partitions)
>    and one for scaling stream. It sounds like with KStream it is only one knob
>    for both.
>    - How would we deal with organic growth of data? Let us say the
>    partitions we chose for the second topic (where it is grouped by
>    consumer_id) is not enough to deal with organic growth in volume. If we
>    increase partitions, for a given consumer some data could be in one
>    partition before the flex up and data could end up in a different topic
>    after flex up. Since the KStream jobs are unique per partition and are
>    stateless across them, the aggregated result would be incorrect, unless we
>    have only one job to read all the data in which case it will become a
>    bottleneck.
>
> In storm (or any other streaming engine), the way to solve it would be to
> have only one topic (partitioned n-ways) and data pushed in to using a
> random partitioner (so no hotspots and scaling issues). We will have n
> Spouts reading data from those partitions and we can then have m bolts
> getting the data using fields grouping on consumer_id. Since all the data
> for a given consumer_id ends up in a bolt we will do the sliding window and
> the alert.
>
> If we solve it the Storm way in KStream, we would only have one topic (partitioned
> n-ways) and data pushed in to using a random partitioner (so no hotspots
> and scaling issues). But we can only have one KStream task running reading
> all the data and doing the windowing and aggregation. This will become a
> bottleneck for scaling.
>
> So it sounds like KStreams will either have "hotspots" in kafka topic (as
> each partition needs to have the data that the KStream task needs and work
> independently) or scaling issues in the KStream task for "aggregation".
>
> How would one solve this kind of problems with KStream?
>
>
>
>
> <https://lh3.googleusercontent.com/-uxCWSNe6nw8/Vusvqd5nLAI/AAAAAAAAD5s/aqY-DexRkn097M9egCLLb3D_ANyCKm60w/s1600/Screen%2BShot%2B2016-03-17%2Bat%2B3.30.17%2BPM.png>
>