You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by songj songj <so...@gmail.com> on 2020/12/01 08:28:43 UTC

why not use spark datasource in DeltaStreamer

hi, I have some questions:

1. DeltaStreamer  has its own Source<JavaRDD<String>> to consume source
data,
such as Kafka, why not use spark datasource directly ?

2. Hudi has lots of logical which use RDD, why not use Spark DataFrame?

I just want to know the background of the above implementation, thanks!

Re: why not use spark datasource in DeltaStreamer

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.
 Regarding rdd vs dataframe, the historical reason is that RDD provided more control with low level API needed for Hudi to managing various aspects of writing. 
On a related note, If you look at the current  approach with Flink support, the input batch is getting parameterized to support different processing engines.
    On Tuesday, December 1, 2020, 02:08:05 AM PST, songj songj <so...@gmail.com> wrote:  
 
 thanks for reply!
could you help to explain my 2 questions  above?

Trevor <wo...@gmail.com> 于2020年12月1日周二 下午5:17写道:

> Hi,songj ,
>
> DeltaStreamer can be understood as a packaged Spark DataSource. You only
> need to set the required parameters, which makes it more convenient for
> data ingest.
>
> Best,
>
> Trevor
>
>
> wowtuanzi@gmail.com
>
> From: songj songj
> Date: 2020-12-01 16:48
> To: dev
> Subject: Re: why not use spark datasource in DeltaStreamer
> spark structured streaming consume kafka using kafka data source, and
> foreachbatch to do insert/upsert/... to hudi,
> is it similar with DeltaStreamer?
>
> songj songj <so...@gmail.com> 于2020年12月1日周二 下午4:28写道:
>
> > hi, I have some questions:
> >
> > 1. DeltaStreamer  has its own Source<JavaRDD<String>> to consume source
> > data,
> > such as Kafka, why not use spark datasource directly ?
> >
> > 2. Hudi has lots of logical which use RDD, why not use Spark DataFrame?
> >
> > I just want to know the background of the above implementation, thanks!
> >
>  

Re: Re: why not use spark datasource in DeltaStreamer

Posted by songj songj <so...@gmail.com>.
thanks for reply!
could you help to explain my 2 questions  above?

Trevor <wo...@gmail.com> 于2020年12月1日周二 下午5:17写道:

> Hi,songj ,
>
> DeltaStreamer can be understood as a packaged Spark DataSource. You only
> need to set the required parameters, which makes it more convenient for
> data ingest.
>
> Best,
>
> Trevor
>
>
> wowtuanzi@gmail.com
>
> From: songj songj
> Date: 2020-12-01 16:48
> To: dev
> Subject: Re: why not use spark datasource in DeltaStreamer
> spark structured streaming consume kafka using kafka data source, and
> foreachbatch to do insert/upsert/... to hudi,
> is it similar with DeltaStreamer?
>
> songj songj <so...@gmail.com> 于2020年12月1日周二 下午4:28写道:
>
> > hi, I have some questions:
> >
> > 1. DeltaStreamer  has its own Source<JavaRDD<String>> to consume source
> > data,
> > such as Kafka, why not use spark datasource directly ?
> >
> > 2. Hudi has lots of logical which use RDD, why not use Spark DataFrame?
> >
> > I just want to know the background of the above implementation, thanks!
> >
>

Re: Re: why not use spark datasource in DeltaStreamer

Posted by Trevor <wo...@gmail.com>.
Hi,songj ,

DeltaStreamer can be understood as a packaged Spark DataSource. You only need to set the required parameters, which makes it more convenient for data ingest.

Best,

Trevor


wowtuanzi@gmail.com
 
From: songj songj
Date: 2020-12-01 16:48
To: dev
Subject: Re: why not use spark datasource in DeltaStreamer
spark structured streaming consume kafka using kafka data source, and
foreachbatch to do insert/upsert/... to hudi,
is it similar with DeltaStreamer?
 
songj songj <so...@gmail.com> 于2020年12月1日周二 下午4:28写道:
 
> hi, I have some questions:
>
> 1. DeltaStreamer  has its own Source<JavaRDD<String>> to consume source
> data,
> such as Kafka, why not use spark datasource directly ?
>
> 2. Hudi has lots of logical which use RDD, why not use Spark DataFrame?
>
> I just want to know the background of the above implementation, thanks!
>

Re: why not use spark datasource in DeltaStreamer

Posted by songj songj <so...@gmail.com>.
spark structured streaming consume kafka using kafka data source, and
foreachbatch to do insert/upsert/... to hudi,
is it similar with DeltaStreamer?

songj songj <so...@gmail.com> 于2020年12月1日周二 下午4:28写道:

> hi, I have some questions:
>
> 1. DeltaStreamer  has its own Source<JavaRDD<String>> to consume source
> data,
> such as Kafka, why not use spark datasource directly ?
>
> 2. Hudi has lots of logical which use RDD, why not use Spark DataFrame?
>
> I just want to know the background of the above implementation, thanks!
>