You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/11/23 07:43:33 UTC

[GitHub] [hudi] xuranyang opened a new issue #4082: [SUPPORT] How to write multiple HUDi tables simultaneously in a Spark Streaming task?

xuranyang opened a new issue #4082:
URL: https://github.com/apache/hudi/issues/4082


   Is there a good way to write multiple HUDi tables simultaneously in a Spark Streaming task?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4082: [SUPPORT] How to write multiple HUDi tables simultaneously in a Spark Streaming task?

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4082:
URL: https://github.com/apache/hudi/issues/4082#issuecomment-1003784185


   @xuranyang : without further info, we can't do much here. Can you please let us know what exactly you are looking for. 
   I am not an expert in structured streaming, but if you are looking to read from one stream and write to diff hudi tables based on some condition, here is one hacky way which I can think of. 
   
   https://gist.github.com/nsivabalan/f7ee7fa611cfc864db7506c016a73787
   
   This intercepts each mirco batch from source stream and writes to hudi table. You can build upon this to have different switch cases and write to different hudi tables. But this would mean your writes are going through spark datasource write and not as streaming write. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xuranyang closed issue #4082: [SUPPORT] How to write multiple HUDi tables simultaneously in a Spark Streaming task?

Posted by GitBox <gi...@apache.org>.

xuranyang closed issue #4082:
URL: https://github.com/apache/hudi/issues/4082


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xuranyang closed issue #4082: [SUPPORT] How to write multiple HUDi tables simultaneously in a Spark Streaming task?

Posted by GitBox <gi...@apache.org>.

xuranyang closed issue #4082:
URL: https://github.com/apache/hudi/issues/4082


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4082: [SUPPORT] How to write multiple HUDi tables simultaneously in a Spark Streaming task?

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4082:
URL: https://github.com/apache/hudi/issues/4082#issuecomment-1018598048


   @xushiyan : Guess author is asking if we can achieve what MultiTableDeltaStreamer does using spark streaming. I don't think we have any readily available solution for this yet. 
   @xuranyang : As raymond suggested, it would be a good addition to hudi feature set. Feel free to start a discussion or RFC around this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Carl-Zhou-CN commented on issue #4082: [SUPPORT] How to write multiple HUDi tables simultaneously in a Spark Streaming task?

Posted by GitBox <gi...@apache.org>.

Carl-Zhou-CN commented on issue #4082:
URL: https://github.com/apache/hudi/issues/4082#issuecomment-979648476


   You can try to concurrent these writes in streaming media at the same time，hudi itself is isolated between different tables


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xuranyang commented on issue #4082: [SUPPORT] How to write multiple HUDi tables simultaneously in a Spark Streaming task?

Posted by GitBox <gi...@apache.org>.

xuranyang commented on issue #4082:
URL: https://github.com/apache/hudi/issues/4082#issuecomment-1005488249


   > @xuranyang : without further info, we can't do much here. Can you please let us know what exactly you are looking for. I am not an expert in structured streaming, but if you are looking to read from one stream and write to diff hudi tables based on some condition, here is one hacky way which I can think of.
   > 
   > https://gist.github.com/nsivabalan/f7ee7fa611cfc864db7506c016a73787
   > 
   > This intercepts each mirco batch from source stream and writes to hudi table. You can build upon this to have different switch cases and write to different hudi tables. But this would mean your writes are going through spark datasource write and not as streaming write.
   
   @nsivabalan ,Thanks for your reply. I want read from multiple streams and write to diff hudi tables.
   Such as writing hudi from multiple Kafka sources.
   I know it can be done with DeltaStreamer.But I wonder if Spark Streaming can do the same.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #4082: [SUPPORT] How to write multiple HUDi tables simultaneously in a Spark Streaming task?

Posted by GitBox <gi...@apache.org>.

xushiyan commented on issue #4082:
URL: https://github.com/apache/hudi/issues/4082#issuecomment-1015092759


   > @xuranyang : are you referring to MultiTableDeltastreamer. I don't think we have any such functionality for now to stream from multiple and write to diff hudi tables. Had to be done manually at the application layer by the user. If you can build some simple framework to get this, please consider upstreaming the functionality to benefit others in the community. thanks!
   
   +1. @xuranyang You can do this with MultiTableDeltaStreamer or using other orchestration tools like kubernetes for example to manage large number of streaming jobs. If you have ideas, feel free to start an RFC or contribute to hudi codebase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4082: [SUPPORT] How to write multiple HUDi tables simultaneously in a Spark Streaming task?

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4082:
URL: https://github.com/apache/hudi/issues/4082#issuecomment-1039646588


   @xuranyang : would you mind starting a dev thread on this. lets see what others have to say. Feel free to close the github issue btw if you don't have any follow ups. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4082: [SUPPORT] How to write multiple HUDi tables simultaneously in a Spark Streaming task?

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4082:
URL: https://github.com/apache/hudi/issues/4082#issuecomment-1039646588


   @xuranyang : would you mind starting a dev thread on this. lets see what others have to say. Feel free to close the github issue btw if you don't have any follow ups. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Carl-Zhou-CN commented on issue #4082: [SUPPORT] How to write multiple HUDi tables simultaneously in a Spark Streaming task?

Posted by GitBox <gi...@apache.org>.

Carl-Zhou-CN commented on issue #4082:
URL: https://github.com/apache/hudi/issues/4082#issuecomment-1005481900


   @nsivabalan I was very interested in the way hudi was written 'But this would mean your writes are going through spark datasource write and not as streaming write.'  Which way does the stream write？


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4082: [SUPPORT] How to write multiple HUDi tables simultaneously in a Spark Streaming task?

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4082:
URL: https://github.com/apache/hudi/issues/4082#issuecomment-991836397


   may I know whats your requirement here? reading from one stream, but depending on certain conditions, write to different hudi tables? or read from multiple streams and write to diff hudi tables? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4082: [SUPPORT] How to write multiple HUDi tables simultaneously in a Spark Streaming task?

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4082:
URL: https://github.com/apache/hudi/issues/4082#issuecomment-991836397


   may I know whats your requirement here? reading from one stream, but depending on certain conditions, write to different hudi tables? or read from multiple streams and write to diff hudi tables? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4082: [SUPPORT] How to write multiple HUDi tables simultaneously in a Spark Streaming task?

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4082:
URL: https://github.com/apache/hudi/issues/4082#issuecomment-1008548662


   @xuranyang : are you referring to MultiTableDeltastreamer. I don't think we have any such functionality for now to stream from multiple and write to diff hudi tables. Had to be done manually at the application layer by the user. 
   If you can build some simple framework to get this, please consider upstreaming the functionality to benefit others in the community. 
   thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org