You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Jay Kreps (JIRA)" <ji...@apache.org> on 2012/10/05 22:02:02 UTC

[jira] [Commented] (KAFKA-544) Retain key in producer

    [ https://issues.apache.org/jira/browse/KAFKA-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470610#comment-13470610 ] 

Jay Kreps commented on KAFKA-544:
---------------------------------

After looking at the code I think there is a fair amount of work here. I recommend we put off the user-facing API change until 0.9. Instead I propose the following intermediate hack for 0.8:
1. Use the existing ProducerData object to get the key and value. This is slightly unnatural because it allows you to associate a key with many values.
2. Use option (2) above for the encoders

So specifically this means that the two interfaces would now be 
trait Encoder[T] {
  def toBytes(t: T)
}
trait Decoder[T] {
  def fromBytes(b: Array[Byte]
}

There would now be two encoders, one for the key and one for the value. The value would still be configured by the property "serializer.class" but we would add a new property "key.serializer.class" which would default to use the same value as the value serializer.

The plan would be to hold off on any changes to the consumer for now.
                
> Retain key in producer
> ----------------------
>
>                 Key: KAFKA-544
>                 URL: https://issues.apache.org/jira/browse/KAFKA-544
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Jay Kreps
>            Assignee: Jay Kreps
>
> KAFKA-506 added support for retaining a key in the messages, however this field is not yet set by the producer.
> The proposal for doing this is to change the producer api to change ProducerData to allow only a single key/value pair so it has a one-to-one mapping to Message. That is change from
>   ProducerData(topic: String, key: K, data: Seq[V])
> to
>   ProducerData(topic: String, key: K, data: V)
> The key itself needs to be encoded. There are several ways this could be handled. A few of the options:
> 1. Change the Encoder and Decoder to be MessageEncoder and MessageDecoder and have them take both a key and value.
> 2. Another option is to change the type of the encoder/decoder to not refer to Message so it could be used for both the key and value.
> I favor the second option but am open to feedback.
> One concern with our current approach to serialization as well as both of these proposals is that they are inefficient. We go from Object=>byte[]=>Message=>MessageSet with a copy at each step. In the case of compression there are a bunch of intermediate steps. We could theoretically clean this up by instead having an interface for the encoder that was something like
>    Encoder.writeTo(buffer: ByteBuffer, object: AnyRef)
> and
>    Decoder.readFrom(buffer:ByteBuffer): AnyRef
> However there are two problems with this. The first is that we don't actually know the size of the data until  it is serialized so we can't really allocate the bytebuffer properly and might need to resize it. The second is that in the case of compression there is a whole other path to consider. Originally I thought maybe it would be good to try to fix this, but now I think it should be out-of-scope and we should revisit the efficiency issue in a future release in conjunction with our internal handling of compression.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira