You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Léo Biscassi (Jira)" <ji...@apache.org> on 2023/04/12 14:54:00 UTC

[jira] [Commented] (HUDI-5997) Support DFS Schema Provider with S3/GCS EventsHoodieIncrSource

    [ https://issues.apache.org/jira/browse/HUDI-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711431#comment-17711431 ] 

Léo Biscassi commented on HUDI-5997:
------------------------------------

Hi [~codope], I have some questions regarding the best way to get the source schema for the {{{}HoodieIncrSource{}}}. I know that we need to add something like the following code in the file {{{}[CloudObjectsSelectorCommon.java|https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java]{}}}:
{code:java}
if (schemaProvider instanceof FilebasedSchemaProvider) {
    DataFrameReader reader = spark.read().schema(SCHEMA_HERE).format(fileFormat);
} else {
    DataFrameReader reader = spark.read().format(fileFormat); // current version
}
{code}
My questions are:
 * For getting the file format we are using the {{{}HoodieIncrSourceConfig.java{}}}, but the config related to the schema provider comes from
the option {{--schemaprovider-class}} parameter on {{HoodieDeltaStreamer.java}}. What would be the recommended way to get this info? Using the {{schemaProvider}} passed to the class constructor? Or another way would be recommended?

{code:java}
  public S3EventsHoodieIncrSource(
      TypedProperties props,
      JavaSparkContext sparkContext,
      SparkSession sparkSession,
      SchemaProvider schemaProvider) {
    super(props, sparkContext, sparkSession, schemaProvider);
  }
{code}

* Is there any documentation on how the {{SchemaProvider}} / {{FileBasedSchemaProvider}} abstraction works? I know they provide two utility methods {{getSourceSchema()}} and {{getTargetSchema()}} which returns an avro schema. I suppose that's not enough for using with the DataFrame Reader schema option, since it requires the schema in DDL format or struct, right? How can I convert that? Asking this because I saw in the {{DeltaSync.java}} a method {{registerAvroSchemas()}}, I'm not sure if I need to do something similar in this case.

I apologize for asking many beginner-level questions. I do not regularly use Java, but I understand where to add the code, and I am eager to gain a deeper understanding of Hudi. This exercise has been quite helpful for me.

> Support DFS Schema Provider with S3/GCS EventsHoodieIncrSource
> --------------------------------------------------------------
>
>                 Key: HUDI-5997
>                 URL: https://issues.apache.org/jira/browse/HUDI-5997
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: deltastreamer
>            Reporter: Sagar Sumit
>            Assignee: Léo Biscassi
>            Priority: Major
>             Fix For: 0.14.0
>
>
> See for more details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)