You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Navina Ramesh (JIRA)" <ji...@apache.org> on 2015/02/04 00:12:36 UTC

[jira] [Updated] (SAMZA-484) Define the serialization/deserialization format for stream tuple

     [ https://issues.apache.org/jira/browse/SAMZA-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Navina Ramesh updated SAMZA-484:
--------------------------------
    Attachment: SAMZA-484.patch

Attached a draft that uses pluggable serde with the Sql layer.
[The patch also includes Yi's changes in SAMZA-482]

SerdeManager is defined in Samza in such a way that :
* serdes mapping from the config is not exposed to the task instances (not accessible by the process() method)
* it serializes/deserializes the key and message individually. This allows for possibly having different serde configuration of key and message.

The idea of a tuple is to wrap key and message within the same entity and provide a uniform access. Hence, it is not possible to implement a "TupleSerde" without making significant changes to the existing code. 

Another alternative is to include the key and message Serde objects in the IncomingMessageEnvelope and use them when wrapping inside IncomingMessageTuple.
Example usage:
public class IncomingMessageEnvelope {
	private final Object key;
	private final Object message;
	private final Serde[Object] keySerde;
	private final Serde[Object] msgSerde;

	...
}

def process(envelope: IncomingMessageEnvelope, ...) {
	val tuple = new IncomingMessageTuple(envelope)
	val message = tuple.getMessage() 	//Returns object implementing the correct Data interface , eg StringData or AvroData
	message.intValue() 
	...
}

Since the above approach also involves modifying the structure of IncomingMessageEnvelope, the other alternative is to define a separate Serde class for each data serde format. For example, we will have a SqlStringSerde that will deserialize bytes to return "StringData" object (which implements the generic Data interface) and serialize StringData object to bytes.
Con: 
1. This causes some amount of code redundancy (*dislike*) because we now have a new Serde[Object] for each data-serde format supported in SQL along with the existing ones.
2. The developer of a Samza job should be aware of the type of "data" object returned since most of the getters would throw UnsupportedOperationException if the operation is not supported.

Pro:
1. Keeps the changes to the core to a minimum and usage still looks like above. 2. Still supports pluggable serialization


> Define the serialization/deserialization format for stream tuple
> ----------------------------------------------------------------
>
>                 Key: SAMZA-484
>                 URL: https://issues.apache.org/jira/browse/SAMZA-484
>             Project: Samza
>          Issue Type: Sub-task
>          Components: sql
>            Reporter: Yi Pan (Data Infrastructure)
>            Priority: Minor
>              Labels: project
>         Attachments: SAMZA-484.patch
>
>
> It came out in the discussion for streaming SQL that we will need to define the serialization/deserialization format for stream tuple.
> The ideal serialization/deserialization format should allow both forward and backward compatibility on additional/missing fields in the data.
> Several choices to be considered:
> 1) Avro
> 2) Protobuf
> 3) Flatbuffer
> It might also be interesting to consider a pluggable serialization interface that allows different serialization methods for different Samza jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)