You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Gary Li <ya...@gmail.com> on 2020/04/21 21:43:28 UTC

[Discussion] Abstraction for HoodieInputFormat and RecordReader

Hi Folks,

I’d like to bring up a discussion regarding the better read support for Hudi dataset.

At this point, Hudi MOR table is depending on Hive MapredParquetInputFormat with the first generation Hadoop MapReduce APIs(org.apache.mapred.xxx), while Spark DataSource is using the second generation(org.apache.mapreduce.xxx). Since combining two sets of APIs will lead to unexpected behaviors, so we need to decouple the Hudi related logics from MapReduce APIs and have two separate support levels for V1 and V2.

With the native Spark Datasource support, we can take advantage of better query performance, more Spark optimizations like VectorizedReader, and more storage format support like ORC. The Abstraction of the InputFormat and RecordReader will also open up the door for more query engines support in the future.

I created an RFC with more details https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstract+HoodieInputFormat+and+RecordReader. Any feedback is appreciated.

Best Regards,
Gary Li

Re: [Discussion] Abstraction for HoodieInputFormat and RecordReader

Posted by Vinoth Chandar <vi...@apache.org>.

Hi Gary,

On COW, today we already let the engines (Spark, Hive, Presto) use their
own readers for parquet..

But as we embark on MOR snapshot query (aka realtime inputformat), it may
make sense to have abstractions in our own code base to efficiently read
base + log files (file slice) out in different common formats -
arraywritable for Hive, row for spark and so on...

I will review the RFC more closely. But high level, I am +1 to evolve our
FileSlice abstraction more cleanly (HUDI-684 's goal is precisely that) and
plug it in flexibly under spark.. We need not be encumbered by the
mapreduce/InputFormat APIs per se.

Thanks
Vinoth

On Wed, Apr 22, 2020 at 6:06 PM leesf <le...@gmail.com> wrote:

> Hi gary,
> Thanks for your proposal, I read the input format codebase before and
> thought it would be optimized, will take a look at the design doc when get
> a chance  and feedback.
>
> Gary Li <ya...@gmail.com> 于2020年4月22日周三 上午5:43写道：
>
> > Hi Folks,
> >
> > I’d like to bring up a discussion regarding the better read support for
> > Hudi dataset.
> >
> > At this point, Hudi MOR table is depending on Hive
> > MapredParquetInputFormat with the first generation Hadoop MapReduce
> > APIs(org.apache.mapred.xxx), while Spark DataSource is using the second
> > generation(org.apache.mapreduce.xxx). Since combining two sets of APIs
> will
> > lead to unexpected behaviors, so we need to decouple the Hudi related
> > logics from MapReduce APIs and have two separate support levels for V1
> and
> > V2.
> >
> > With the native Spark Datasource support, we can take advantage of better
> > query performance, more Spark optimizations like VectorizedReader, and
> more
> > storage format support like ORC. The Abstraction of the InputFormat and
> > RecordReader will also open up the door for more query engines support in
> > the future.
> >
> > I created an RFC with more details
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstract+HoodieInputFormat+and+RecordReader
> .
> > Any feedback is appreciated.
> >
> > Best Regards,
> > Gary Li
> >
> >
>

Re: [Discussion] Abstraction for HoodieInputFormat and RecordReader

Posted by leesf <le...@gmail.com>.

Hi gary,
Thanks for your proposal, I read the input format codebase before and
thought it would be optimized, will take a look at the design doc when get
a chance  and feedback.

Gary Li <ya...@gmail.com> 于2020年4月22日周三 上午5:43写道：

> Hi Folks,
>
> I’d like to bring up a discussion regarding the better read support for
> Hudi dataset.
>
> At this point, Hudi MOR table is depending on Hive
> MapredParquetInputFormat with the first generation Hadoop MapReduce
> APIs(org.apache.mapred.xxx), while Spark DataSource is using the second
> generation(org.apache.mapreduce.xxx). Since combining two sets of APIs will
> lead to unexpected behaviors, so we need to decouple the Hudi related
> logics from MapReduce APIs and have two separate support levels for V1 and
> V2.
>
> With the native Spark Datasource support, we can take advantage of better
> query performance, more Spark optimizations like VectorizedReader, and more
> storage format support like ORC. The Abstraction of the InputFormat and
> RecordReader will also open up the door for more query engines support in
> the future.
>
> I created an RFC with more details
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstract+HoodieInputFormat+and+RecordReader.
> Any feedback is appreciated.
>
> Best Regards,
> Gary Li
>
>