You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/28 16:57:34 UTC

[GitHub] [hudi] yihua commented on issue #5952: [SUPPORT] HudiDeltaStreamer S3EventSource SQS optimize for reading large number of files in parallel fashion

yihua commented on issue #5952:
URL: https://github.com/apache/hudi/issues/5952#issuecomment-1168986618

   Thanks for the feature request.
   
   The referenced code you mentioned in `S3EventsSource` converts the Json records already in Dataset to Dataframe for further processing.  Do you actually refer to the optimization of reading events from SQS (which should not actually involve file reading)?
   ```
   Dataset<String> eventRecords = sparkSession.createDataset(selectPathsWithLatestSqsMessage.getLeft(), Encoders.STRING());
         return Pair.of(
             Option.of(sparkSession.read().json(eventRecords)),
             selectPathsWithLatestSqsMessage.getRight());
   ```
   
   Feel free to create a Jira ticket for the feature request and I encourage you to put up a PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org