You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Gurudatt Kulkarni <gu...@gmail.com> on 2019/10/01 07:03:13 UTC

Using Hudi to Pull multiple tables

Hi All,

I have a use case where I need to pull multiple tables (say close to 100)
into Hadoop. Do we need to schedule 100 Hudi jobs to pull these tables? Can
there be a workaround where there is one Hudi Application pulling from
multiple Kafka topics? This will avoid creating multiple SparkSessions and
avoid the memory overhead that comes with it.

Regards,
Gurudatt

Re: Using Hudi to Pull multiple tables

Posted by Vinoth Chandar <vi...@apache.org>.

https://issues.apache.org/jira/browse/HUDI-288 tracks this....



On Tue, Oct 1, 2019 at 10:17 AM Vinoth Chandar <vi...@apache.org> wrote:

>
> I think this has come up before.
>
> +1 to the point pratyaksh mentioned. I would like to add a few more
>
> - Schema could be fetched dynamically from a registry based on
> topic/dataset name. Solvable
> - The hudi keys, partition fields and the inputs you need for configuring
> hudi needs to be standardized. Solvable using dataset level overrides.
> - You will get one RDD from Kafka with data for multiple topics. This
> needs to be now forked to multiple datasets. We need to cache the kafka RDD
> in memory, otherwise we will recompute and re-read input everytime from
> Kafka. Expensive. Solvable.
> - Finally, you will be writing different parquet schema from different
> files and if you are running with num_core > 2, also concurrently. At Uber,
> we original did that and it became an operational nightmare to isolate bad
> topics from good ones.. Pretty tricky!
>
> In all, we could support this and call out these caveats well.
>
> In terms of work,
>
> - We can either introduce multi source support to DeltaStreamer natively
> (more involved design work needed to specify how each input stream maps to
> each output stream)
> - (Or) we can write a new tool that wraps the current DeltaStreamer, just
> uses the kafka topic regex to identify all topics that need to be ingested,
> and just creates one delta streamer each topic within a SINGLE spark
> application.
>
>
> Any takers for this?  Should be a pretty cool project, doable in a week or
> two.
>
> /thanks/vinoth
>
> On Tue, Oct 1, 2019 at 12:39 AM Pratyaksh Sharma <pr...@gmail.com>
> wrote:
>
>> Hi Gurudatt,
>>
>> With a minimal code change, you can subscribe to multiple Kafka topics
>> using KafkaOffsetGen.java class. I feel the bigger problem in this case is
>> going to be managing multiple target schemas because we register
>> ParquetWriter with a single target schema at a time. I would also like to
>> know if we have a workaround for such a case.
>>
>> On Tue, Oct 1, 2019 at 12:33 PM Gurudatt Kulkarni <gu...@gmail.com>
>> wrote:
>>
>> > Hi All,
>> >
>> > I have a use case where I need to pull multiple tables (say close to
>> 100)
>> > into Hadoop. Do we need to schedule 100 Hudi jobs to pull these tables?
>> Can
>> > there be a workaround where there is one Hudi Application pulling from
>> > multiple Kafka topics? This will avoid creating multiple SparkSessions
>> and
>> > avoid the memory overhead that comes with it.
>> >
>> > Regards,
>> > Gurudatt
>> >
>>
>

Re: Using Hudi to Pull multiple tables

Posted by Vinoth Chandar <vi...@apache.org>.

I think this has come up before.

+1 to the point pratyaksh mentioned. I would like to add a few more

- Schema could be fetched dynamically from a registry based on
topic/dataset name. Solvable
- The hudi keys, partition fields and the inputs you need for configuring
hudi needs to be standardized. Solvable using dataset level overrides.
- You will get one RDD from Kafka with data for multiple topics. This needs
to be now forked to multiple datasets. We need to cache the kafka RDD in
memory, otherwise we will recompute and re-read input everytime from Kafka.
Expensive. Solvable.
- Finally, you will be writing different parquet schema from different
files and if you are running with num_core > 2, also concurrently. At Uber,
we original did that and it became an operational nightmare to isolate bad
topics from good ones.. Pretty tricky!

In all, we could support this and call out these caveats well.

In terms of work,

- We can either introduce multi source support to DeltaStreamer natively
(more involved design work needed to specify how each input stream maps to
each output stream)
- (Or) we can write a new tool that wraps the current DeltaStreamer, just
uses the kafka topic regex to identify all topics that need to be ingested,
and just creates one delta streamer each topic within a SINGLE spark
application.


Any takers for this?  Should be a pretty cool project, doable in a week or
two.

/thanks/vinoth

On Tue, Oct 1, 2019 at 12:39 AM Pratyaksh Sharma <pr...@gmail.com>
wrote:

> Hi Gurudatt,
>
> With a minimal code change, you can subscribe to multiple Kafka topics
> using KafkaOffsetGen.java class. I feel the bigger problem in this case is
> going to be managing multiple target schemas because we register
> ParquetWriter with a single target schema at a time. I would also like to
> know if we have a workaround for such a case.
>
> On Tue, Oct 1, 2019 at 12:33 PM Gurudatt Kulkarni <gu...@gmail.com>
> wrote:
>
> > Hi All,
> >
> > I have a use case where I need to pull multiple tables (say close to 100)
> > into Hadoop. Do we need to schedule 100 Hudi jobs to pull these tables?
> Can
> > there be a workaround where there is one Hudi Application pulling from
> > multiple Kafka topics? This will avoid creating multiple SparkSessions
> and
> > avoid the memory overhead that comes with it.
> >
> > Regards,
> > Gurudatt
> >
>

Re: Using Hudi to Pull multiple tables

Posted by Pratyaksh Sharma <pr...@gmail.com>.

Hi Gurudatt,

With a minimal code change, you can subscribe to multiple Kafka topics
using KafkaOffsetGen.java class. I feel the bigger problem in this case is
going to be managing multiple target schemas because we register
ParquetWriter with a single target schema at a time. I would also like to
know if we have a workaround for such a case.

On Tue, Oct 1, 2019 at 12:33 PM Gurudatt Kulkarni <gu...@gmail.com>
wrote:

> Hi All,
>
> I have a use case where I need to pull multiple tables (say close to 100)
> into Hadoop. Do we need to schedule 100 Hudi jobs to pull these tables? Can
> there be a workaround where there is one Hudi Application pulling from
> multiple Kafka topics? This will avoid creating multiple SparkSessions and
> avoid the memory overhead that comes with it.
>
> Regards,
> Gurudatt
>