You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Daniel Ford (Jira)" <ji...@apache.org> on 2023/03/22 10:34:00 UTC

[jira] [Created] (HUDI-5973) Add cachedSchema per write batch to fix idempotency with getSourceSchema calls

Daniel Ford created HUDI-5973:
---------------------------------

             Summary: Add cachedSchema per write batch to fix idempotency with getSourceSchema calls
                 Key: HUDI-5973
                 URL: https://issues.apache.org/jira/browse/HUDI-5973
             Project: Apache Hudi
          Issue Type: Task
          Components: deltastreamer
            Reporter: Daniel Ford


The issue is. getSourceScheme in case of SchemaRegistry provider is not idempotent. even within a single batch of write, if we call getSourceSchema multiple times, it could return latest schema from the schema registry. ideally we want it to return one schema for one batch of write.
so, the fix is to add a new api to Source abstract class called "clearCaches" or "cleanupResources". also add similar apis to SchemaProvider. and so within source.clearCaches, we will call schemaProvider.clearCaches.
Incase of SchemaRegistryProvider, for every batch, we will fetch from remote schema registry and cache is locally. for subsequent calls to getsourceSchema, we will be returning the same value. before moving onto next batch of consume, we will have to call clearCaches which will invalidate the local cache of source schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)