You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Michal Michalski <mi...@zalando.ie> on 2017/10/02 13:29:11 UTC

The idea of "composite key" to make log compaction more flexible - question / proposal

Hi,

TL;DR: I'd love to be able to make log compaction more "granular" than just
per-partition-key, so I was thinking about the concept of a "composite
key", where partitioning logic is using one part of the key, while
compaction uses the whole key - is this something desirable / doable /
worth a KIP?

Longer story / use case:

I'm currently a member of a team working on a project that's using a bunch
of applications to ingest data to the system (one "entity type" per app).
Once ingested by each application, since the entities are referring to each
other, they're all published to a single topic to ensure ordering for later
processing stages. Because of the nature of the data, for a given set of
entities related together, there's always a single "master" / parent"
entity, which ID we're using as the partition key; to give an example:
let's say you have "product" entity which can have things like "media",
"reviews", "stocks" etc. associated with it - product ID will be the
partition key for *all* these entities. However, with this approach we
simply cannot use log compaction because having e.g. "product", "media" and
"review" events, all with the same partition key "X", means that compaction
process will at some point delete all but one of them, causing a data loss
- only a single entity with key "X" will remain (and that's absolutely
correct - Kafka doesn't "understand" what does the message contain).

We were thinking about introducing something we internally called
"composite key". The idea is to have a key that's not just a single String
K, but a pair of Strings: (K1, K2). For specifying the partition that the
message should be sent to, K1 would be used; however, for log compaction
purposes, the whole (K1, K2) would be used instead. This way, referring to
the example above, different entities "belonging" to the same "master
entity" (product), could be published to that topic with composite keys:
(productId, "product"), (productId, "media") and (productId, "review"), so
they all end up in single partition (specified by K1, which is always:
productId), but they won't get compacted together, because the K2 part is
different for them, making the whole "composite key" (K1, K2) different. Of
course K2 would be optional, so for someone who only needs the default
behaviour nothing would change.

Since I'm not a Kafka developer and I don't know its internals that well, I
can't say if this idea is technically feasible or not, but I'd think it is
- I'd be more afraid of the complexity around backwards compatibility etc.
and potential performance implications of such change.

I know that similar behaviour is achievable by using the producer API that
allows explicitly specifying the partition ID (and the key), but I think
it's a bit "clunky" (for each message, generate a key that this message
should normally be using [productId] and somehow "map" that key into a
partition X; then send that message to this partition X, *but* use the
"compaction" key instead [productId, entity type] as the message key) and
it's something that could be abstracted away from the user.

Thoughts?

Question to Kafka users: Is this something that anyone here would find
useful? Is anyone here dealing with similar problem?

Question to Kafka maintainers: Is this something that you could potentially
consider a useful feature? Would it be worth a KIP? Is something like this
(technically) doable at all?

--
Kind regards,
Michał Michalski

Re: The idea of "composite key" to make log compaction more flexible - question / proposal

Posted by Michal Michalski <mi...@zalando.ie>.
Hey Jay,

Thanks for reply. Yes, this should do the job.

We were thinking about something that's abstracting away this logic from
user (e.g. the same way Cassandra handles its PK definitions in CQL -
"hiding" the row key and optional clustering key behind the concept of
"primary key"), but introducing such design obviously has some pros/cons
and non-trivial implications, so if using the partitioner interface is the
way to go in Kafka - we'll use it :-)

Thanks,
Michał


On 5 October 2017 at 15:22, Jay Kreps <ja...@confluent.io> wrote:

> I think you can do this now by using a custom partitioner, no?
>
> https://kafka.apache.org/0110/javadoc/org/apache/kafka/
> clients/producer/Partitioner.html
>
> -Jay
>
> On Mon, Oct 2, 2017 at 6:29 AM Michal Michalski <
> michal.michalski@zalando.ie>
> wrote:
>
> > Hi,
> >
> > TL;DR: I'd love to be able to make log compaction more "granular" than
> just
> > per-partition-key, so I was thinking about the concept of a "composite
> > key", where partitioning logic is using one part of the key, while
> > compaction uses the whole key - is this something desirable / doable /
> > worth a KIP?
> >
> > Longer story / use case:
> >
> > I'm currently a member of a team working on a project that's using a
> bunch
> > of applications to ingest data to the system (one "entity type" per app).
> > Once ingested by each application, since the entities are referring to
> each
> > other, they're all published to a single topic to ensure ordering for
> later
> > processing stages. Because of the nature of the data, for a given set of
> > entities related together, there's always a single "master" / parent"
> > entity, which ID we're using as the partition key; to give an example:
> > let's say you have "product" entity which can have things like "media",
> > "reviews", "stocks" etc. associated with it - product ID will be the
> > partition key for *all* these entities. However, with this approach we
> > simply cannot use log compaction because having e.g. "product", "media"
> and
> > "review" events, all with the same partition key "X", means that
> compaction
> > process will at some point delete all but one of them, causing a data
> loss
> > - only a single entity with key "X" will remain (and that's absolutely
> > correct - Kafka doesn't "understand" what does the message contain).
> >
> > We were thinking about introducing something we internally called
> > "composite key". The idea is to have a key that's not just a single
> String
> > K, but a pair of Strings: (K1, K2). For specifying the partition that the
> > message should be sent to, K1 would be used; however, for log compaction
> > purposes, the whole (K1, K2) would be used instead. This way, referring
> to
> > the example above, different entities "belonging" to the same "master
> > entity" (product), could be published to that topic with composite keys:
> > (productId, "product"), (productId, "media") and (productId, "review"),
> so
> > they all end up in single partition (specified by K1, which is always:
> > productId), but they won't get compacted together, because the K2 part is
> > different for them, making the whole "composite key" (K1, K2) different.
> Of
> > course K2 would be optional, so for someone who only needs the default
> > behaviour nothing would change.
> >
> > Since I'm not a Kafka developer and I don't know its internals that
> well, I
> > can't say if this idea is technically feasible or not, but I'd think it
> is
> > - I'd be more afraid of the complexity around backwards compatibility
> etc.
> > and potential performance implications of such change.
> >
> > I know that similar behaviour is achievable by using the producer API
> that
> > allows explicitly specifying the partition ID (and the key), but I think
> > it's a bit "clunky" (for each message, generate a key that this message
> > should normally be using [productId] and somehow "map" that key into a
> > partition X; then send that message to this partition X, *but* use the
> > "compaction" key instead [productId, entity type] as the message key) and
> > it's something that could be abstracted away from the user.
> >
> > Thoughts?
> >
> > Question to Kafka users: Is this something that anyone here would find
> > useful? Is anyone here dealing with similar problem?
> >
> > Question to Kafka maintainers: Is this something that you could
> potentially
> > consider a useful feature? Would it be worth a KIP? Is something like
> this
> > (technically) doable at all?
> >
> > --
> > Kind regards,
> > Michał Michalski
> >
>

Re: The idea of "composite key" to make log compaction more flexible - question / proposal

Posted by Jay Kreps <ja...@confluent.io>.
I think you can do this now by using a custom partitioner, no?

https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/producer/Partitioner.html

-Jay

On Mon, Oct 2, 2017 at 6:29 AM Michal Michalski <mi...@zalando.ie>
wrote:

> Hi,
>
> TL;DR: I'd love to be able to make log compaction more "granular" than just
> per-partition-key, so I was thinking about the concept of a "composite
> key", where partitioning logic is using one part of the key, while
> compaction uses the whole key - is this something desirable / doable /
> worth a KIP?
>
> Longer story / use case:
>
> I'm currently a member of a team working on a project that's using a bunch
> of applications to ingest data to the system (one "entity type" per app).
> Once ingested by each application, since the entities are referring to each
> other, they're all published to a single topic to ensure ordering for later
> processing stages. Because of the nature of the data, for a given set of
> entities related together, there's always a single "master" / parent"
> entity, which ID we're using as the partition key; to give an example:
> let's say you have "product" entity which can have things like "media",
> "reviews", "stocks" etc. associated with it - product ID will be the
> partition key for *all* these entities. However, with this approach we
> simply cannot use log compaction because having e.g. "product", "media" and
> "review" events, all with the same partition key "X", means that compaction
> process will at some point delete all but one of them, causing a data loss
> - only a single entity with key "X" will remain (and that's absolutely
> correct - Kafka doesn't "understand" what does the message contain).
>
> We were thinking about introducing something we internally called
> "composite key". The idea is to have a key that's not just a single String
> K, but a pair of Strings: (K1, K2). For specifying the partition that the
> message should be sent to, K1 would be used; however, for log compaction
> purposes, the whole (K1, K2) would be used instead. This way, referring to
> the example above, different entities "belonging" to the same "master
> entity" (product), could be published to that topic with composite keys:
> (productId, "product"), (productId, "media") and (productId, "review"), so
> they all end up in single partition (specified by K1, which is always:
> productId), but they won't get compacted together, because the K2 part is
> different for them, making the whole "composite key" (K1, K2) different. Of
> course K2 would be optional, so for someone who only needs the default
> behaviour nothing would change.
>
> Since I'm not a Kafka developer and I don't know its internals that well, I
> can't say if this idea is technically feasible or not, but I'd think it is
> - I'd be more afraid of the complexity around backwards compatibility etc.
> and potential performance implications of such change.
>
> I know that similar behaviour is achievable by using the producer API that
> allows explicitly specifying the partition ID (and the key), but I think
> it's a bit "clunky" (for each message, generate a key that this message
> should normally be using [productId] and somehow "map" that key into a
> partition X; then send that message to this partition X, *but* use the
> "compaction" key instead [productId, entity type] as the message key) and
> it's something that could be abstracted away from the user.
>
> Thoughts?
>
> Question to Kafka users: Is this something that anyone here would find
> useful? Is anyone here dealing with similar problem?
>
> Question to Kafka maintainers: Is this something that you could potentially
> consider a useful feature? Would it be worth a KIP? Is something like this
> (technically) doable at all?
>
> --
> Kind regards,
> Michał Michalski
>