You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Chris Riccomini (JIRA)" <ji...@apache.org> on 2015/02/04 00:40:34 UTC
[jira] [Commented] (SAMZA-484) Define the serialization/deserialization format for stream tuple

    [ https://issues.apache.org/jira/browse/SAMZA-484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304319#comment-14304319 ] 

Chris Riccomini commented on SAMZA-484:
---------------------------------------

bq. This causes some amount of code redundancy (dislike) because we now have a new Serde[Object] for each data-serde format supported in SQL along with the existing ones.

Ya, this is a bummer.

What if we wrap, rather than duplicate, the existing Serdes? So, the Sql\*Serde classes would be responsible for mapping to/from the underlying Serde objects?

{code}
public class SqlStringSerde implements Serde<StringData> {
    Serde<String> serde;

    public SqlStringSerde(String encoding) {
        this.serde = new StringSerde(encoding);
    }

    @Override
    public StringData fromBytes(byte[] bytes) {
      return new StringData(serde.fromBytes(bytes));
    }

    @Override
    public byte[] toBytes(StringData object) {
      serde.toBytes(object.strValue());
    }
}
{code}

It's still not quite ideal, but at least we're not duplicating code. I think we'd have to implement Sql serdes and Data objects for String, byte, Integer, Long, Json, and Avro. It's kind of annoying, but I think it's the most semantically accurate. If you use a StringSerde, you'll get String objects back. If you use a AvroDataSerde you'll get Data objects back that wrap Avro objects.

The only alternatives that I can think of are what you list, and what was discussed on SAMZA-429. Personally, I don't want to force a data model on Samza, nor do I really want to have confusing duplicate APIs in IncomingMessageEnvelope, so that seems to leave only the Serde approach. At least this way, only SQL (code-generated configs), or developers using Sql operator tasks, will really get hit by this.

> Define the serialization/deserialization format for stream tuple
> ----------------------------------------------------------------
>
>                 Key: SAMZA-484
>                 URL: https://issues.apache.org/jira/browse/SAMZA-484
>             Project: Samza
>          Issue Type: Sub-task
>          Components: sql
>            Reporter: Yi Pan (Data Infrastructure)
>            Assignee: Navina Ramesh
>            Priority: Minor
>              Labels: project
>         Attachments: SAMZA-484.patch
>
>
> It came out in the discussion for streaming SQL that we will need to define the serialization/deserialization format for stream tuple.
> The ideal serialization/deserialization format should allow both forward and backward compatibility on additional/missing fields in the data.
> Several choices to be considered:
> 1) Avro
> 2) Protobuf
> 3) Flatbuffer
> It might also be interesting to consider a pluggable serialization interface that allows different serialization methods for different Samza jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)