You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by leesf <le...@gmail.com> on 2021/11/09 15:54:47 UTC

[DISCUSS] Move to Spark DataSource V2 API

Hi all,

I did see the community discuss moving to V2 datasource API before [1] but
get no more progress. So I want to bring up the discussion again to move to
spark datasource V2 api, Hudi still uses V1 api and relies heavily on RDD
api to index, repartition and so on given the flexibility of RDD API.
However V2 api eliminates RDD usage and introduces CatalogPlugin mechanism
to give the ability to manage Hudi tables and totally new writing and
reading interface, so it caused some challenges since Hudi uses the RDD in
both writing and reading path, However I think it is still necessary to
integrate Hudi with V2 api as the V1 api is too old and the benefits from
V2 api optimization such as more pushdown filters regarding query side to
accelerate the query speed when integrating with RFC-27 [2].

And here is work I think we should do when moving to V2 api.

1. Integrate with V2 writing interface(Bulk_insert row path already
implemented, but not for upsert/insert operations, would fallback to V1
writing code path)
2. Integrate with V2 reading interface
3. Introducing CatalogPlugin to manage Hudi tables
4. Total use V2 writing interface(use Iterator<InternalRow> that may need
some refactor to HoodieSparkWriteClient to make precombining, indexing etc
working fine).

Please add other work that no mentioned above and would love to hear other
opinions and feedback from the community. I see there is already an
umbrella ticket to track datasource V2 [3] and I will put on a RFC for more
details, also you would join the channel #spark-datasource-v2 in Hudi slack
for more discussion

[1]
https://lists.apache.org/thread.html/r0411d53b46d8bb2a57c697e295c83a274fa0bc817a2a8ca8eb103a3d%40%3Cdev.hudi.apache.org%3E
[2]
https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance
[3] https://issues.apache.org/jira/browse/HUDI-1297



Thanks
Leesf

Re: [DISCUSS] Move to Spark DataSource V2 API

Posted by Vinoth Chandar <vi...@apache.org>.
Hi all,

Sorry. Bit late to the party here. +1 on kicking this off and +1 on reusing
the work Raymond has already kickstarted here.
I think we are in a good position to roll with this approach.

The biggest issue with V2 on the writing side, remains that fact that we
cannot really "shuffle" data for precombine or indexing after
we issue spark.write.format(..) ?

I propose the following approach

- Instead of changing the existing V1 datasource, we make a new "*hudi-v2*"
datasource
- Writes in "hudi-v2" just supports bulk_insert or overwrites like what the
spark.parquet.write does.
- We introduce a new `SparkDataSetWriteClient` or equivalent, that offer
programmatic access for all the rich APIs we have currently so users can
use it for delete, insert, update, index, etc...

I think this approach is really flexible. users can use v1 for streaming
updates if they prefer that. It also offers a migration path from v1 to v2
that does not involve breaking the pipelines.

I see that there is a RFC open. Will get on it once my release blockers are
in better shape

Thanks
Vinoth




On Mon, Nov 15, 2021 at 6:26 AM leesf <le...@gmail.com> wrote:

> Thanks Raymond for sharing the work has been done, agree that the 1st
> approach would need more work and time to make it  totally adapted to V2
> interfaces and would be different with different engines. Thus, to abstract
> the Hudi core writing/reading framework to adapt to different engines
> (approach 2 ) looks good to me at the moment since we do not need extra
> work to adapt to other engines and focus on spark writing/reading side.
>
> Raymond Xu <xu...@gmail.com> 于2021年11月14日周日 下午5:44写道:
>
> > Great initiative and idea, Leesf.
> >
> > Totally agreed on the benefits of adopting V2 APIs. On the 4th point
> "Total
> > use V2 writing interface"
> >
> > I have previously worked on implementing upsert with V2 writing interface
> > with SimpleIndex using broadcast join. The POC worked without fully
> > integrating with other table services. The downside of going this route
> > would be re-implementing most of the logic we have today with the RDD
> > writer path, including different indexing implementations, which are
> > non-trivial.
> >
> > Another route I've PoC'ed is to treat the current RDD writer path as Hudi
> > "writer framework": input Dataset<Row> going through different components
> > as we see today Client -> Specific ActionExecutor -> Helper ->
> > (dedup/indexing/tagging/build profile) -> Base Write ActionExecutor ->
> (map
> > partitions and perform write on Row iterator via parquet writer/reader)
> ->
> > return Dataset<WriterStatus>
> >
> > As you can see, the 1st approach is to adopt an engine-native framework
> (V2
> > writing interface in this case) to realize Hudi operations while the 2nd
> > approach is to adopt the Hudi "writer framework" by using engine-native
> > data-level APIs to realize Hudi operations. The 2nd approach gives better
> > flexibility in adopting different engines; it leverages engines'
> > capabilities to manipulate data while ensuring write operations
> > were realized in the "Hudi" way. The prerequisite to this is to have a
> > flexible Hudi abstraction on top of different engines' data-level APIs.
> > Ethan has landed 2 major abstraction PRs to pave the way for it, which
> will
> > enable a great deal of code-reuse.
> >
> > The Hudi "writer framework" today consists of a bunch of Java classes. It
> > can be formalized and refactored along the way while implementing Row
> > writing. Once the "framework" is formalized, its flexibility can really
> > shine on bringing in new processing engines to Hudi. Something similar
> > could be done on the reader path too I suppose.
> >
> > On Tue, Nov 9, 2021 at 7:55 AM leesf <le...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > I did see the community discuss moving to V2 datasource API before [1]
> > but
> > > get no more progress. So I want to bring up the discussion again to
> move
> > to
> > > spark datasource V2 api, Hudi still uses V1 api and relies heavily on
> RDD
> > > api to index, repartition and so on given the flexibility of RDD API.
> > > However V2 api eliminates RDD usage and introduces CatalogPlugin
> > mechanism
> > > to give the ability to manage Hudi tables and totally new writing and
> > > reading interface, so it caused some challenges since Hudi uses the RDD
> > in
> > > both writing and reading path, However I think it is still necessary to
> > > integrate Hudi with V2 api as the V1 api is too old and the benefits
> from
> > > V2 api optimization such as more pushdown filters regarding query side
> to
> > > accelerate the query speed when integrating with RFC-27 [2].
> > >
> > > And here is work I think we should do when moving to V2 api.
> > >
> > > 1. Integrate with V2 writing interface(Bulk_insert row path already
> > > implemented, but not for upsert/insert operations, would fallback to V1
> > > writing code path)
> > > 2. Integrate with V2 reading interface
> > > 3. Introducing CatalogPlugin to manage Hudi tables
> > > 4. Total use V2 writing interface(use Iterator<InternalRow> that may
> need
> > > some refactor to HoodieSparkWriteClient to make precombining, indexing
> > etc
> > > working fine).
> > >
> > > Please add other work that no mentioned above and would love to hear
> > other
> > > opinions and feedback from the community. I see there is already an
> > > umbrella ticket to track datasource V2 [3] and I will put on a RFC for
> > more
> > > details, also you would join the channel #spark-datasource-v2 in Hudi
> > slack
> > > for more discussion
> > >
> > > [1]
> > >
> > >
> >
> https://lists.apache.org/thread.html/r0411d53b46d8bb2a57c697e295c83a274fa0bc817a2a8ca8eb103a3d%40%3Cdev.hudi.apache.org%3E
> > > [2]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance
> > > [3] https://issues.apache.org/jira/browse/HUDI-1297
> > >
> > >
> > >
> > > Thanks
> > > Leesf
> > >
> >
>

Reply:Re: [DISCUSS] Move to Spark DataSource V2 API

Posted by 受春柏 <sc...@126.com>.
Hi,leesf
  Supporting Spark datasource V2 has always been something we wanted to do, and we can get involved
在 2021-11-15 22:26:17,"leesf" <le...@gmail.com> 写道:
>Thanks Raymond for sharing the work has been done, agree that the 1st
>approach would need more work and time to make it  totally adapted to V2
>interfaces and would be different with different engines. Thus, to abstract
>the Hudi core writing/reading framework to adapt to different engines
>(approach 2 ) looks good to me at the moment since we do not need extra
>work to adapt to other engines and focus on spark writing/reading side.
>
>Raymond Xu <xu...@gmail.com> 于2021年11月14日周日 下午5:44写道:
>
>> Great initiative and idea, Leesf.
>>
>> Totally agreed on the benefits of adopting V2 APIs. On the 4th point "Total
>> use V2 writing interface"
>>
>> I have previously worked on implementing upsert with V2 writing interface
>> with SimpleIndex using broadcast join. The POC worked without fully
>> integrating with other table services. The downside of going this route
>> would be re-implementing most of the logic we have today with the RDD
>> writer path, including different indexing implementations, which are
>> non-trivial.
>>
>> Another route I've PoC'ed is to treat the current RDD writer path as Hudi
>> "writer framework": input Dataset<Row> going through different components
>> as we see today Client -> Specific ActionExecutor -> Helper ->
>> (dedup/indexing/tagging/build profile) -> Base Write ActionExecutor -> (map
>> partitions and perform write on Row iterator via parquet writer/reader) ->
>> return Dataset<WriterStatus>
>>
>> As you can see, the 1st approach is to adopt an engine-native framework (V2
>> writing interface in this case) to realize Hudi operations while the 2nd
>> approach is to adopt the Hudi "writer framework" by using engine-native
>> data-level APIs to realize Hudi operations. The 2nd approach gives better
>> flexibility in adopting different engines; it leverages engines'
>> capabilities to manipulate data while ensuring write operations
>> were realized in the "Hudi" way. The prerequisite to this is to have a
>> flexible Hudi abstraction on top of different engines' data-level APIs.
>> Ethan has landed 2 major abstraction PRs to pave the way for it, which will
>> enable a great deal of code-reuse.
>>
>> The Hudi "writer framework" today consists of a bunch of Java classes. It
>> can be formalized and refactored along the way while implementing Row
>> writing. Once the "framework" is formalized, its flexibility can really
>> shine on bringing in new processing engines to Hudi. Something similar
>> could be done on the reader path too I suppose.
>>
>> On Tue, Nov 9, 2021 at 7:55 AM leesf <le...@gmail.com> wrote:
>>
>> > Hi all,
>> >
>> > I did see the community discuss moving to V2 datasource API before [1]
>> but
>> > get no more progress. So I want to bring up the discussion again to move
>> to
>> > spark datasource V2 api, Hudi still uses V1 api and relies heavily on RDD
>> > api to index, repartition and so on given the flexibility of RDD API.
>> > However V2 api eliminates RDD usage and introduces CatalogPlugin
>> mechanism
>> > to give the ability to manage Hudi tables and totally new writing and
>> > reading interface, so it caused some challenges since Hudi uses the RDD
>> in
>> > both writing and reading path, However I think it is still necessary to
>> > integrate Hudi with V2 api as the V1 api is too old and the benefits from
>> > V2 api optimization such as more pushdown filters regarding query side to
>> > accelerate the query speed when integrating with RFC-27 [2].
>> >
>> > And here is work I think we should do when moving to V2 api.
>> >
>> > 1. Integrate with V2 writing interface(Bulk_insert row path already
>> > implemented, but not for upsert/insert operations, would fallback to V1
>> > writing code path)
>> > 2. Integrate with V2 reading interface
>> > 3. Introducing CatalogPlugin to manage Hudi tables
>> > 4. Total use V2 writing interface(use Iterator<InternalRow> that may need
>> > some refactor to HoodieSparkWriteClient to make precombining, indexing
>> etc
>> > working fine).
>> >
>> > Please add other work that no mentioned above and would love to hear
>> other
>> > opinions and feedback from the community. I see there is already an
>> > umbrella ticket to track datasource V2 [3] and I will put on a RFC for
>> more
>> > details, also you would join the channel #spark-datasource-v2 in Hudi
>> slack
>> > for more discussion
>> >
>> > [1]
>> >
>> >
>> https://lists.apache.org/thread.html/r0411d53b46d8bb2a57c697e295c83a274fa0bc817a2a8ca8eb103a3d%40%3Cdev.hudi.apache.org%3E
>> > [2]
>> >
>> >
>> https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance
>> > [3] https://issues.apache.org/jira/browse/HUDI-1297
>> >
>> >
>> >
>> > Thanks
>> > Leesf
>> >
>>

Re: [DISCUSS] Move to Spark DataSource V2 API

Posted by leesf <le...@gmail.com>.
Thanks Raymond for sharing the work has been done, agree that the 1st
approach would need more work and time to make it  totally adapted to V2
interfaces and would be different with different engines. Thus, to abstract
the Hudi core writing/reading framework to adapt to different engines
(approach 2 ) looks good to me at the moment since we do not need extra
work to adapt to other engines and focus on spark writing/reading side.

Raymond Xu <xu...@gmail.com> 于2021年11月14日周日 下午5:44写道:

> Great initiative and idea, Leesf.
>
> Totally agreed on the benefits of adopting V2 APIs. On the 4th point "Total
> use V2 writing interface"
>
> I have previously worked on implementing upsert with V2 writing interface
> with SimpleIndex using broadcast join. The POC worked without fully
> integrating with other table services. The downside of going this route
> would be re-implementing most of the logic we have today with the RDD
> writer path, including different indexing implementations, which are
> non-trivial.
>
> Another route I've PoC'ed is to treat the current RDD writer path as Hudi
> "writer framework": input Dataset<Row> going through different components
> as we see today Client -> Specific ActionExecutor -> Helper ->
> (dedup/indexing/tagging/build profile) -> Base Write ActionExecutor -> (map
> partitions and perform write on Row iterator via parquet writer/reader) ->
> return Dataset<WriterStatus>
>
> As you can see, the 1st approach is to adopt an engine-native framework (V2
> writing interface in this case) to realize Hudi operations while the 2nd
> approach is to adopt the Hudi "writer framework" by using engine-native
> data-level APIs to realize Hudi operations. The 2nd approach gives better
> flexibility in adopting different engines; it leverages engines'
> capabilities to manipulate data while ensuring write operations
> were realized in the "Hudi" way. The prerequisite to this is to have a
> flexible Hudi abstraction on top of different engines' data-level APIs.
> Ethan has landed 2 major abstraction PRs to pave the way for it, which will
> enable a great deal of code-reuse.
>
> The Hudi "writer framework" today consists of a bunch of Java classes. It
> can be formalized and refactored along the way while implementing Row
> writing. Once the "framework" is formalized, its flexibility can really
> shine on bringing in new processing engines to Hudi. Something similar
> could be done on the reader path too I suppose.
>
> On Tue, Nov 9, 2021 at 7:55 AM leesf <le...@gmail.com> wrote:
>
> > Hi all,
> >
> > I did see the community discuss moving to V2 datasource API before [1]
> but
> > get no more progress. So I want to bring up the discussion again to move
> to
> > spark datasource V2 api, Hudi still uses V1 api and relies heavily on RDD
> > api to index, repartition and so on given the flexibility of RDD API.
> > However V2 api eliminates RDD usage and introduces CatalogPlugin
> mechanism
> > to give the ability to manage Hudi tables and totally new writing and
> > reading interface, so it caused some challenges since Hudi uses the RDD
> in
> > both writing and reading path, However I think it is still necessary to
> > integrate Hudi with V2 api as the V1 api is too old and the benefits from
> > V2 api optimization such as more pushdown filters regarding query side to
> > accelerate the query speed when integrating with RFC-27 [2].
> >
> > And here is work I think we should do when moving to V2 api.
> >
> > 1. Integrate with V2 writing interface(Bulk_insert row path already
> > implemented, but not for upsert/insert operations, would fallback to V1
> > writing code path)
> > 2. Integrate with V2 reading interface
> > 3. Introducing CatalogPlugin to manage Hudi tables
> > 4. Total use V2 writing interface(use Iterator<InternalRow> that may need
> > some refactor to HoodieSparkWriteClient to make precombining, indexing
> etc
> > working fine).
> >
> > Please add other work that no mentioned above and would love to hear
> other
> > opinions and feedback from the community. I see there is already an
> > umbrella ticket to track datasource V2 [3] and I will put on a RFC for
> more
> > details, also you would join the channel #spark-datasource-v2 in Hudi
> slack
> > for more discussion
> >
> > [1]
> >
> >
> https://lists.apache.org/thread.html/r0411d53b46d8bb2a57c697e295c83a274fa0bc817a2a8ca8eb103a3d%40%3Cdev.hudi.apache.org%3E
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance
> > [3] https://issues.apache.org/jira/browse/HUDI-1297
> >
> >
> >
> > Thanks
> > Leesf
> >
>

Re: [DISCUSS] Move to Spark DataSource V2 API

Posted by Raymond Xu <xu...@gmail.com>.
Great initiative and idea, Leesf.

Totally agreed on the benefits of adopting V2 APIs. On the 4th point "Total
use V2 writing interface"

I have previously worked on implementing upsert with V2 writing interface
with SimpleIndex using broadcast join. The POC worked without fully
integrating with other table services. The downside of going this route
would be re-implementing most of the logic we have today with the RDD
writer path, including different indexing implementations, which are
non-trivial.

Another route I've PoC'ed is to treat the current RDD writer path as Hudi
"writer framework": input Dataset<Row> going through different components
as we see today Client -> Specific ActionExecutor -> Helper ->
(dedup/indexing/tagging/build profile) -> Base Write ActionExecutor -> (map
partitions and perform write on Row iterator via parquet writer/reader) ->
return Dataset<WriterStatus>

As you can see, the 1st approach is to adopt an engine-native framework (V2
writing interface in this case) to realize Hudi operations while the 2nd
approach is to adopt the Hudi "writer framework" by using engine-native
data-level APIs to realize Hudi operations. The 2nd approach gives better
flexibility in adopting different engines; it leverages engines'
capabilities to manipulate data while ensuring write operations
were realized in the "Hudi" way. The prerequisite to this is to have a
flexible Hudi abstraction on top of different engines' data-level APIs.
Ethan has landed 2 major abstraction PRs to pave the way for it, which will
enable a great deal of code-reuse.

The Hudi "writer framework" today consists of a bunch of Java classes. It
can be formalized and refactored along the way while implementing Row
writing. Once the "framework" is formalized, its flexibility can really
shine on bringing in new processing engines to Hudi. Something similar
could be done on the reader path too I suppose.

On Tue, Nov 9, 2021 at 7:55 AM leesf <le...@gmail.com> wrote:

> Hi all,
>
> I did see the community discuss moving to V2 datasource API before [1] but
> get no more progress. So I want to bring up the discussion again to move to
> spark datasource V2 api, Hudi still uses V1 api and relies heavily on RDD
> api to index, repartition and so on given the flexibility of RDD API.
> However V2 api eliminates RDD usage and introduces CatalogPlugin mechanism
> to give the ability to manage Hudi tables and totally new writing and
> reading interface, so it caused some challenges since Hudi uses the RDD in
> both writing and reading path, However I think it is still necessary to
> integrate Hudi with V2 api as the V1 api is too old and the benefits from
> V2 api optimization such as more pushdown filters regarding query side to
> accelerate the query speed when integrating with RFC-27 [2].
>
> And here is work I think we should do when moving to V2 api.
>
> 1. Integrate with V2 writing interface(Bulk_insert row path already
> implemented, but not for upsert/insert operations, would fallback to V1
> writing code path)
> 2. Integrate with V2 reading interface
> 3. Introducing CatalogPlugin to manage Hudi tables
> 4. Total use V2 writing interface(use Iterator<InternalRow> that may need
> some refactor to HoodieSparkWriteClient to make precombining, indexing etc
> working fine).
>
> Please add other work that no mentioned above and would love to hear other
> opinions and feedback from the community. I see there is already an
> umbrella ticket to track datasource V2 [3] and I will put on a RFC for more
> details, also you would join the channel #spark-datasource-v2 in Hudi slack
> for more discussion
>
> [1]
>
> https://lists.apache.org/thread.html/r0411d53b46d8bb2a57c697e295c83a274fa0bc817a2a8ca8eb103a3d%40%3Cdev.hudi.apache.org%3E
> [2]
>
> https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance
> [3] https://issues.apache.org/jira/browse/HUDI-1297
>
>
>
> Thanks
> Leesf
>