You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Daniel Ford (Jira)" <ji...@apache.org> on 2023/03/22 10:34:00 UTC
[jira] [Created] (HUDI-5973) Add cachedSchema per write batch to fix idempotency with getSourceSchema calls
Daniel Ford created HUDI-5973:
---------------------------------
Summary: Add cachedSchema per write batch to fix idempotency with getSourceSchema calls
Key: HUDI-5973
URL: https://issues.apache.org/jira/browse/HUDI-5973
Project: Apache Hudi
Issue Type: Task
Components: deltastreamer
Reporter: Daniel Ford
The issue is. getSourceScheme in case of SchemaRegistry provider is not idempotent. even within a single batch of write, if we call getSourceSchema multiple times, it could return latest schema from the schema registry. ideally we want it to return one schema for one batch of write.
so, the fix is to add a new api to Source abstract class called "clearCaches" or "cleanupResources". also add similar apis to SchemaProvider. and so within source.clearCaches, we will call schemaProvider.clearCaches.
Incase of SchemaRegistryProvider, for every batch, we will fetch from remote schema registry and cache is locally. for subsequent calls to getsourceSchema, we will be returning the same value. before moving onto next batch of consume, we will have to call clearCaches which will invalidate the local cache of source schema.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)