You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Patrik Kleindl <pk...@gmail.com> on 2019/04/26 07:22:19 UTC

KAFKA-8037 - KTable/GlobalKTable may restore bad data from source topic

Hello

During an incident analysis it was discovered that the restoration process
of state stores which are based directly on a source topic (=not a
changelog topic) is not protected against undeserializable recods in the
same way normal processing is with the LogAndContinueExceptionHandler.
Because the restore writes the byte array directly to the store such
records can be inserted and cause problems when accessing the store.
To prevent this users would have to catch such problems in their code which
can easily be forgotten.
Stores based on a changelog topic don't have this problem as such records
are handled during processing already.

I did a small PR for GlobalKTables which uses the Deserializer during the
restoration process and can handle such records.

During the PR review John Roesler brought up the question if it was an
alternative to use the normal processing instead of the restoration in this
cases:
<quote>
My question is, in the cases where we're "restoring" from the input topic,
rather than from the changelog, why bother with the "restore" code path at
all, why not just process the input topic normally? The RecordQueue would
ensure that we read all the older records from this store's input before
processing newer records from other topics anyway. This effectively means
that we'd "restore" the state of the store in question before trying to use
it for joins, etc., which I think is what really matters in the end.
<quote>

As this might have other implications I want to take this back here for
discussion (or should it be discussed on JIRA)
Two questions that came to my mind:

   - This would possibly limit the performance for this kind of "restore"
   as the restore consumer can be tuned differently than the normal consumer
   - I am not sure if this is equally applicable for the global stores
   because their restore is different from the local ones and they have to be
   completely restored before any other processing (including restoration of
   local stores) can start

Any feedback is welcome

best regards

Patrik