You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/02/25 07:22:00 UTC
[jira] [Commented] (KAFKA-3744) Message format needs to identify serializer

    [ https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375972#comment-16375972 ] 

ASF GitHub Bot commented on KAFKA-3744:
---------------------------------------

hachikuji closed pull request #1419: KAFKA-3744: Allocate 2 attribute bits to signal payload format
URL: https://github.com/apache/kafka/pull/1419
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/docs/implementation.html b/docs/implementation.html
index 16ba07a456c..aa58d8c740a 100644
--- a/docs/implementation.html
+++ b/docs/implementation.html
@@ -146,12 +146,24 @@ <h3><a id="messages" href="#messages">5.3 Messages</a></h3>
 <p>
 Messages consist of a fixed-size header, a variable length opaque key byte array and a variable length opaque value byte array. The header contains the following fields:
 <ul>
-    <li> A CRC32 checksum to detect corruption or truncation. <li/>
+    <li> A CRC32 checksum to detect corruption or truncation. </li>
     <li> A format version. </li>
     <li> An attributes identifier </li>
     <li> A timestamp </li>
 </ul>
-Leaving the key and value opaque is the right decision: there is a great deal of progress being made on serialization libraries right now, and any particular choice is unlikely to be right for all uses. Needless to say a particular application using Kafka would likely mandate a particular serialization type as part of its usage. The <code>MessageSet</code> interface is simply an iterator over messages with specialized methods for bulk reading and writing to an NIO <code>Channel</code>.
+Leaving the key and payload mostly opaque is the right decision: there is a great deal of progress being made on serialization libraries right now, and any particular choice is unlikely to be right for all uses. But to facilitate interoperability two attribute bits are defined as a serialization selector:
+<ul>
+  <li>0 and 1 specify two payload encodings (text and avro-binary); key format is unspecified.</li>
+  <li>2 specifies that the key must be a JSON object with a property "t" containing a
+<a href="http://www.iana.org/assignments/media-types/media-types.xhtml">media-type</a> string
+registered with IANA.  For example, key <pre>  {"t":"application/cbor"}</pre> specifies that the
+payload is serialized using Concise Binary Object Representation, RFC 7049. The JSON object in key
+may contain an arbitrary set of additional properties.  Using media-type allows payloads of any
+registered format (e.g., image/jpeg, application/pdf) to be specified.</li>
+  <li>3 is reserved; key and payload formats are unspecified.</ul>
+</ul>
+
+<code>MessageSet</code> interface is simply an iterator over messages with specialized methods for bulk reading and writing to an NIO <code>Channel</code>.
 
 <h3><a id="messageformat" href="#messageformat">5.4 Message Format</a></h3>
 
@@ -165,10 +177,16 @@ <h3><a id="messageformat" href="#messageformat">5.4 Message Format</a></h3>
      *      1 : gzip
      *      2 : snappy
      *      3 : lz4
+     *      4~7 : reserved
      *    bit 3 : Timestamp type
      *      0 : create time
      *      1 : log append time
-     *    bit 4 ~ 7 : reserved
+     *    bit 4 ~ 5 : Serialization
+     *      0 : key: opaque, payload: text/plain
+     *      1 : key: opaque, payload: avro-binary
+     *      2 : key: json object, payload: media-type specified by property "t"
+     *      3 : reserved (key: opaque, payload: opaque)
+     *    bit 6 ~ 7 : reserved
      * 4. (Optional) 8 byte timestamp only if "magic" identifier is greater than 0
      * 5. 4 byte key length, containing length K
      * 6. K byte key
@@ -195,8 +213,8 @@ <h3><a id="log" href="#log">5.5 Log</a></h3>
 timestamp      : 8 bytes (Only exists when magic value is greater than zero)
 key length     : 4 bytes
 key            : K bytes
-value length   : 4 bytes
-value          : V bytes
+payload length : 4 bytes
+payload        : V bytes
 </pre>
 <p>
 The use of the message offset as the message id is unusual. Our original idea was to use a GUID generated by the producer, and maintain a mapping from GUID to offset on each broker. But since a consumer must maintain an ID for each server, the global uniqueness of the GUID provides no value. Furthermore the complexity of maintaining the mapping from a random id to an offset requires a heavy weight index structure which must be synchronized with disk, essentially requiring a full persistent random-access data structure. Thus to simplify the lookup structure we decided to use a simple per-partition atomic counter which could be coupled with the partition id and node id to uniquely identify a message; this makes the lookup structure simpler, though multiple seeks per consumer request are still likely. However once we settled on a counter, the jump to directly using the offset seemed natural&mdash;both after all are monotonically increasing integers unique to a partition. Since the offset is hidden from the consumer API this decision is ultimately an implementation detail and we went with the more efficient approach.


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Message format needs to identify serializer
> -------------------------------------------
>
>                 Key: KAFKA-3744
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3744
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: David Kay
>            Priority: Minor
>
> https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0.
> But Kafka documentation on message formats needs to be more explicit for new users. Section 1.3 Step 4 says: "Send some messages" and takes lines of text from the command line. Beginner's guide (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign Slide 104 says:
> {noformat}
>    Kafka does not care about data format of msg payload
>    Up to developer to handle serialization/deserialization
>       Common choices: Avro, JSON
> {noformat}
> If one producer sends lines of console text, another producer sends Avro, a third producer sends JSON, and a fourth sends CBOR, how does the consumer identify which deserializer to use for the payload?  The commit includes an opaque K byte Key that could potentially include a codec identifier, but provides no guidance on how to use it:
> {quote}
> "Leaving the key and value opaque is the right decision: there is a great deal of progress being made on serialization libraries right now, and any particular choice is unlikely to be right for all uses. Needless to say a particular application using Kafka would likely mandate a particular serialization type as part of its usage."
> {quote}
> Mandating any particular serialization is as unrealistic as mandating a single mime-type for all web content.  There must be a way to signal the serialization used to produce this message's V byte payload, and documenting the existence of even a rudimentary codec registry with a few values (text, Avro, JSON, CBOR) would establish the pattern to be used for future serialization libraries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)