You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/16 16:40:20 UTC

[GitHub] [hudi] rubenssoto opened a new issue #1839: Question, Add Support to Hudi datasets to spark structured streaming

rubenssoto opened a new issue #1839:
URL: https://github.com/apache/hudi/issues/1839

Hi guys, how are you?

I have some use cases that I want to read using structured streaming from a hudi dataset and write to another grouped hudi dataset. In a real world example, I have a raw zone in my datalake, and want to streaming from raw zone to curated zone, but in sometimes my curated hudi dataset is grouped.

Spark streaming don't work with hudi datasets sources, so to this use case works I need to treat hudi dataset like a normal parquet dataset, but hudi rewrite data every time and the new file has the old data plus new data, if my sink isn't grouped, it's only a deduplication problem but my sink is grouped so it isn't gonna work.

I don't have guarantee that all my grouped data is in the new file that hudi writes.

I use pyspark to write my streaming jobs, its easier for my team, o I think that delta streamer is not an option.

Do you have some idea how to solve this? And you have plans to support hudi dataset to a spark streaming source?

Delta Lake has solved this problem with ignoreChanges option
https://docs.databricks.com/delta/delta-streaming.html

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar closed issue #1839: Question, Add Support to Hudi datasets to spark structured streaming

Posted by GitBox <gi...@apache.org>.

bvaradar closed issue #1839:
URL: https://github.com/apache/hudi/issues/1839


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] garyli1019 commented on issue #1839: Question, Add Support to Hudi datasets to spark structured streaming

Posted by GitBox <gi...@apache.org>.

garyli1019 commented on issue #1839:
URL: https://github.com/apache/hudi/issues/1839#issuecomment-661272780


   This is an interesting feature. I created a ticket to track this. https://issues.apache.org/jira/browse/HUDI-1114.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1839: Question, Add Support to Hudi datasets to spark structured streaming

Posted by GitBox <gi...@apache.org>.

rubenssoto commented on issue #1839:
URL: https://github.com/apache/hudi/issues/1839#issuecomment-661235040


   Hi Vinoth, thank you for your anwser.
   
   I will see your video, probably incremental query will help me for now, but we want to use spark structured streaming like a default for all our datasets, spark streaming take care about checkpoint and stuffs like this.
   
   If will could add spark structured streaming integration in a future version, will be great.
   
   Thank you! :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1839: Question, Add Support to Hudi datasets to spark structured streaming

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1839:
URL: https://github.com/apache/hudi/issues/1839#issuecomment-668665466


   Closing this ticket in favor of jira to track the feature request


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #1839: Question, Add Support to Hudi datasets to spark structured streaming

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #1839:
URL: https://github.com/apache/hudi/issues/1839#issuecomment-661204597


   @rubenssoto yes. we already support incremental queries using the spark datasource. It seems like the only thing missing here is that you want the spark structured streaming integration? (which we can add after 0.6.0)
   https://hudi.apache.org/docs/querying_data.html#spark-incr-query
   
   https://www.youtube.com/watch?v=1w3IpavhSWA actually talks about a production use-case we build using an incremental query + some grouping on the sink side. Unlike delta, Hudi actually has record level metadata around arrival times and thus does not need anything like ignoreChanges. 
   
   I am not sure if I am missing something around your use-case, but feels like you should be able to get this working incrementally end-end with what we have today (again, we can add spark streaming read support.. if there are hands to help.. cc @garyli1019? :)) 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org