You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "leobiscassi (via GitHub)" <gi...@apache.org> on 2023/03/17 01:22:08 UTC

[GitHub] [hudi] leobiscassi opened a new issue, #8211: [SUPPORT] DFS Schema Provider not working with S3EventsHoodieIncrSource

leobiscassi opened a new issue, #8211:
URL: https://github.com/apache/hudi/issues/8211

   **Describe the problem you faced**
   
   I am running a delta streamer job to ingest JSON files from S3 using the `S3EventsHoodieIncrSource`. In this use case, I need to enforce the schema in the source files because there may or may not be some fields depending on certain occasions. According to the docs, I can do this using the `hoodie.deltastreamer.schemaprovider.source.schema.file` parameter, but it doesn't seem to be working.
   
   Although the documentation states that **"For sources that return Dataset<Row>, the schema is obtained implicitly. However, this CLI option allows overriding the schema provider returned by Source"**, this does not seem to apply to the specific source being referred to. Upon examining [this piece of code](https://github.com/apache/hudi/blob/178767948e906f673d6d4a357c65c11bc574f619/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java#L133), it appears that the informed schema is not being explicitly set.
   
   ```java
       String fileFormat = props.getString(SOURCE_FILE_FORMAT, DEFAULT_SOURCE_FILE_FORMAT);
       Option<Dataset<Row>> dataset = Option.empty();
       if (!cloudFiles.isEmpty()) {
         dataset = Option.of(sparkSession.read().format(fileFormat).load(cloudFiles.toArray(new String[0])));
       }
       return Pair.of(dataset, instantEndpts.getRight());
   ```
   
   If I inform a source schema using the parameter `hoodie.deltastreamer.schemaprovider.source.schema.file`, I expect that the schema will be enforced over all the files read in the job. Is it appropriate to consider this a bug? Should I fill a bug ticket on Jira?
   
   P.S: If my assumptions and analysis are right, I'd have interest in submitting a fix for this, since are affecting my workloads 😄 
   
   **Environment Description**
   
   This is happening in all hudi versions that I tested >= 0.9, I have jobs running with 0.9 and 0.11 on EMR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leobiscassi commented on issue #8211: [SUPPORT] DFS Schema Provider not working with S3EventsHoodieIncrSource

Posted by "leobiscassi (via GitHub)" <gi...@apache.org>.
leobiscassi commented on issue #8211:
URL: https://github.com/apache/hudi/issues/8211#issuecomment-1505422943

   @codope I've commented in the jira ticket with some questions, I think it's a better place to have the discussion, that way it's easier for other people to look in the future. Thanks in advance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leobiscassi commented on issue #8211: [SUPPORT] DFS Schema Provider not working with S3EventsHoodieIncrSource

Posted by "leobiscassi (via GitHub)" <gi...@apache.org>.
leobiscassi commented on issue #8211:
URL: https://github.com/apache/hudi/issues/8211#issuecomment-1490792676

   @codope yes, I would, but I would need some guidance tough. Do you think that's possible?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #8211: [SUPPORT] DFS Schema Provider not working with S3EventsHoodieIncrSource

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #8211:
URL: https://github.com/apache/hudi/issues/8211#issuecomment-1488037163

   The incremental source infers schema by simply loading the dataset from the source table. What you're proposing is a good enhancement. Would you like to take it up? HUDI-5997


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #8211: [SUPPORT] DFS Schema Provider not working with S3EventsHoodieIncrSource

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #8211:
URL: https://github.com/apache/hudi/issues/8211#issuecomment-1501944734

   @leobiscassi The [dev setup page](https://hudi.apache.org/contribute/developer-setup) has all the details to help you with getting started with project. If you face any issues, I can sync up with you over a call. 
   As for the enhancement, we just need to enforce and set the schema while loading dataset in `S3EventsHoodieIncrSource` (the code block that you posted).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leobiscassi commented on issue #8211: [SUPPORT] DFS Schema Provider not working with S3EventsHoodieIncrSource

Posted by "leobiscassi (via GitHub)" <gi...@apache.org>.
leobiscassi commented on issue #8211:
URL: https://github.com/apache/hudi/issues/8211#issuecomment-1502007905

   @codope awesome, I'm going to start to work on this today and let you know in case I face some road blocker.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #8211: [SUPPORT] DFS Schema Provider not working with S3EventsHoodieIncrSource

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #8211:
URL: https://github.com/apache/hudi/issues/8211#issuecomment-1506985113

   Makes sense. Have updated the ticket.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org