You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Martin Kleppmann (JIRA)" <ji...@apache.org> on 2014/04/28 18:44:14 UTC

[jira] [Resolved] (SAMZA-232) Keys and values in state should be versioned

     [ https://issues.apache.org/jira/browse/SAMZA-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martin Kleppmann resolved SAMZA-232.
------------------------------------

    Resolution: Not a Problem
      Assignee: Martin Kleppmann

Having thought about it further, I agree, so I'm closing this as "not a problem".

Background thinking: I first thought that metadata on changelog messages would be necessary in order to achieve exactly-once semantics, but that is not the case. If we want to associate a particular point in time in the changelog with a particular point in time in the input stream offsets, we can do so by checkpointing the input stream offset and the changelog offset together. When a job restarts and restores its state from the changelog, it can consume up to the checkpointed changelog offset in order to get a consistent snapshot of the state.

> Keys and values in state should be versioned
> --------------------------------------------
>
>                 Key: SAMZA-232
>                 URL: https://issues.apache.org/jira/browse/SAMZA-232
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Martin Kleppmann
>            Assignee: Martin Kleppmann
>
> At the moment, keys and values that are written to a task's key-value store (and the associated changelog stream) are just the bytes that were generated by the serde. This will be a problem in future, since it gives us no way of changing the storage format.
> For example, in order to implement exactly-once semantics, we may want to associate additional metadata with each value (and that metadata would be managed by the framework, and would not be seen by serdes). The current implementation does not give us any room to make such a change, because a job would not know whether the value it is reading includes metadata or not.
> I propose that we prefix every key and every value in the key-value store and the changelog stream with a version number, currently just a zero byte. That is an incompatible change, so we should do it before the 0.7.0 release. In future, if we ever need to change the storage format, we can bump the version number and thus allow jobs to be gracefully upgraded in-place.



--
This message was sent by Atlassian JIRA
(v6.2#6252)