You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Jan Bols <ja...@telenet.be> on 2020/10/26 20:38:55 UTC

Partitioning per team

For a kafka-streams application, we keep data per team. Data from 2 teams
never meet but within a team, data is highly integrated. A team has team
members but also has several types of equipment.
A team has a lifespan of about 1-3 days after which the team is removed and
all data relating to that team should be evicted.

How would you partition the data?
- Using the team id as key for all streams seems not ideal b/c this means
all aggregations need to happen per team involving a ser/deser of the
entire team data. Suppose there's 10 team members and only 1 team member is
sending events that need to be aggregated. In this case, we need a
ser/deser of the entire aggregated team data. I'm afraid this would result
in quite a bit of overhead because.
- Using the user id or equipment id as key would result in much smaller
aggregations but does mean quite a bit of repartitioning when aggregating
and joining users of the same team.

I ended up using the second approach, but I wonder if that was really a
good idea b/c the entire streaming logic does become quite involved.

What is your experience with this type of data?

Best regards
Jan

Re: Partitioning per team

Posted by Guozhang Wang <wa...@gmail.com>.
Hello Jan,

One alternative approach you can consider is to use combo <team, user> as
the key, hence it achieves the small aggregation, while customizing your
partitioner for the repartition topic such that keys with the same <team>
prefix always go to the same partition. Then when cleaning up data,
similarly within the store you can do a range on prefix <team> and delete
all entries of <team, user> when the team is removed.

Guozhang




On Mon, Oct 26, 2020 at 1:39 PM Jan Bols <ja...@telenet.be> wrote:

> For a kafka-streams application, we keep data per team. Data from 2 teams
> never meet but within a team, data is highly integrated. A team has team
> members but also has several types of equipment.
> A team has a lifespan of about 1-3 days after which the team is removed and
> all data relating to that team should be evicted.
>
> How would you partition the data?
> - Using the team id as key for all streams seems not ideal b/c this means
> all aggregations need to happen per team involving a ser/deser of the
> entire team data. Suppose there's 10 team members and only 1 team member is
> sending events that need to be aggregated. In this case, we need a
> ser/deser of the entire aggregated team data. I'm afraid this would result
> in quite a bit of overhead because.
> - Using the user id or equipment id as key would result in much smaller
> aggregations but does mean quite a bit of repartitioning when aggregating
> and joining users of the same team.
>
> I ended up using the second approach, but I wonder if that was really a
> good idea b/c the entire streaming logic does become quite involved.
>
> What is your experience with this type of data?
>
> Best regards
> Jan
>


-- 
-- Guozhang