You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Pratyaksh Sharma (Jira)" <ji...@apache.org> on 2022/03/07 18:19:00 UTC

[jira] [Assigned] (HUDI-2318) Enhance and stablize multi-table deltastreamer

     [ https://issues.apache.org/jira/browse/HUDI-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pratyaksh Sharma reassigned HUDI-2318:
--------------------------------------

    Assignee: Pratyaksh Sharma

> Enhance and stablize multi-table deltastreamer
> ----------------------------------------------
>
>                 Key: HUDI-2318
>                 URL: https://issues.apache.org/jira/browse/HUDI-2318
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Utilities
>            Reporter: sivabalan narayanan
>            Assignee: Pratyaksh Sharma
>            Priority: Major
>
> Currently multi-table deltastreamer supports COW and only for run once mode. We need to enhance lot more and make it usable for all different scenarios. 
>  
> There are asks from the community on this. Typical use-cases:
> I have 1000+ tables and I wish to ingest all of them into hudi efficiently. I don't want to use 1000+ delta streamer instances as I have to allot resources for every deltastreamer instance. 
>  
> Requirements
>  * Add MOR support to Multi-table deltastreamer
>  * Add continuous mode support to multi-table ds.
>  * Add support to sync concurrently across diff tables.  As of now, each table is synced serially which may not work out well for 1000+ tables. And we may not want to sync all 1000+ tables concurrently. But using a thread-pool, we can achieve some level of concurrency. 
>  ** Check out [https://github.com/apache/hudi/issues/2175] to ingest to multiple hudi tables using spark structured streaming. We can also try to see if we can add it as utility. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)