You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Yixue (Andrew) Zhu (Jira)" <ji...@apache.org> on 2020/05/06 17:34:00 UTC

[jira] [Comment Edited] (HUDI-603) HoodieDeltaStreamer should periodically fetch table schema update

    [ https://issues.apache.org/jira/browse/HUDI-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101034#comment-17101034 ] 

Yixue (Andrew) Zhu edited comment on HUDI-603 at 5/6/20, 5:33 PM:
------------------------------------------------------------------

I just started working on this, and come up with a seemingly reasonable approach:

If some delta stream configuration enabled (continuous mode and a new config "providerSchemaChangeSupported"), when provider schema changed, restart the program to use the new schema.
 We can use this option for a couple of reasons:
 # Spark serialization of Avro record and schema is optimized when schemas are registered before program is executed, i.e. executors are spawned by the driver.
 If we refresh schema w/o recreating SparkConf, which is not supported by Spark without restating the program, the serialization optimization would be defeated.
 # It is not frequent for table schema to be updated.

By throwing exception in the DeltaSync::syncOnce(), the following Spark configuration would restart the program:
   --conf spark.yarn.max.maxAppAttempts
   --conf spark.yarn.am.attemptFailuresValidityInterval


was (Author: yx3zhu@gmail.com):
I just started working on this, and come up with a seemingly reasonable approach:

If some delta stream configuration enabled, when provider schema changed, restart the program to use the new schema.
We can use this option for a couple of reasons:
 # Spark serialization of Avro record and schema is optimized when schemas are registered before program is executed, i.e. executors are spawned by the driver.
If we refresh schema w/o recreating SparkConf, which is not supported by Spark without restating the program, the serialization optimization would be defeated.
 # It is not frequent for table schema to be updated.

By throwing exception in the DeltaSync::syncOnce(), the following Spark configuration would restart the program:
  --conf spark.yarn.max.maxAppAttempts
  --conf spark.yarn.am.attemptFailuresValidityInterval

> HoodieDeltaStreamer should periodically fetch table schema update
> -----------------------------------------------------------------
>
>                 Key: HUDI-603
>                 URL: https://issues.apache.org/jira/browse/HUDI-603
>             Project: Apache Hudi (incubating)
>          Issue Type: Bug
>          Components: DeltaStreamer
>            Reporter: Yixue Zhu
>            Assignee: Pratyaksh Sharma
>            Priority: Major
>              Labels: evolution, pull-request-available, schema
>
> HoodieDeltaStreamer create SchemaProvider instance and delegate to DeltaSync for periodical sync. However, default implementation of SchemaProvider does not refresh schema, which can change due to schema evolution. DeltaSync snapshot the schema when it creates writeClient, using the SchemaProvider instance or pick up from source, and the schema for writeClient is not refreshed during the loop of Sync.
> I think this needs to be addressed to support schema evolution fully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)