You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Jakob Homan (JIRA)" <ji...@apache.org> on 2014/03/20 23:39:44 UTC

[jira] [Commented] (SAMZA-198) Provide SystemStreamPartition info to SerDe fromBytes/toBytes methods

    [ https://issues.apache.org/jira/browse/SAMZA-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13942435#comment-13942435 ] 

Jakob Homan commented on SAMZA-198:
-----------------------------------

To note, there are two other ways to accomplish this, but both have drawbacks
* Write a specific SerDe against each event type you're processing.  The SerDe knows it will only be receiving messages of a certain type and can provide the extra information itself.  Run two Samza jobs to process the input streams, each with its own set of input streams.  However, this doubles the number of jobs you're running, which is unnecessary.
* Have a separate systemm in the config to handle each type of event and assign each respective Avro event to it.  For each of those systems, then provide a specific serde as above.  However, this doubles the number of systems running about and for Kafka doubles the number of BrokerProxies and splits the metrics into two different groups.

> Provide SystemStreamPartition info to SerDe fromBytes/toBytes methods
> ---------------------------------------------------------------------
>
>                 Key: SAMZA-198
>                 URL: https://issues.apache.org/jira/browse/SAMZA-198
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Jakob Homan
>
> Right now the Deserializer fromBytes method takes just a byte array, meaning that it doesn't know anything about where those bytes came from.
> We have a use case with Avro messages coming from Kafka where we may be getting several different versions of the same schema (each different version coming from a different stream-partition).  This works okay.  However, in the same stream task, we're actually consuming from more than one type of Avro message and each of those types has that same situation.
> Once we're in the process method we can take the generic record and poke it for its internal structure to see what type and version it is.  At this point we can re-encode it if necessary to bring its schema version up to the latest before sending it on.  However, this extra work is expensive and is dominating the time spent in the process method.
> However, if at deserialization time we knew what SSP the message came from, we could provide the Avro GenericDatumReader the reader schema, thus saving the expensive re-encode step in the process method.
> I imagine other systems could benefit from this extra info as well.  The information is available in the IncomingMessageEnvelope when we call the deserializer, it's just not being passed in.
> (A parallel argument applies to the toBytes method in the Serializer interface)



--
This message was sent by Atlassian JIRA
(v6.2#6252)