You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Vinoth Chandar <vi...@apache.org> on 2021/07/15 04:13:33 UTC

[DISCUSS] Move to spark v2 datasource API

Folks,

As you may know, we still use the V1 API, given it the flexibility further
transform the dataframe, after one calls `df.write.format()`, to implement
a fully featured write pipeline with precombining, indexing, custom
partitioning. V2 API takes this away and rather provides a very restrictive
API that simply provides a partition level write interface that hands a
Iterator<InternalRow>.

That said, v2 has evolved (again) in Spark 3 and we would like to get with
the V2 APIs at some point, for both querying and writing. This thread
summarizes a few approaches we can take.

*Option 1 : Introduce a pre write hook in Spark Datasource API*
If datasource API provided a simple way to further transform dataframes,
after the call to df.write.format is done, we would be able to move much of
our HoodieSparkSQLWriter logic into that and make the transition.

Sivabalan engaged with the Spark community around this, without luck.
Anyone who can help revive this or make a more successful attempt, please
chime in.

*Option 2 : Introduce a new datasource hudi_v2 for Spark Datasource +
HoodieSparkClient API*

We would limit datasource writes to simply bulk_inserts or
insert_overwrites. All other write operations would be supported via a new
HoodieSparkClient API (similar to all the write clients we have, but works
with DataSet<Row>). Queries will be supported on the v2 APIs. This will be
done only for Spark 3.

We would still keep the current v1 support until Spark supports it.
Obviously, users have to migrate pipelines to hudi_v2 at some point, if
datasource v1 support is dropped

My concern is having two datasources, causing greater confusion for the
users.

Maybe there are others that I did not list out here. Please add

Thanks
Vinoth

Re: [DISCUSS] Move to spark v2 datasource API

Posted by Vinoth Chandar <vi...@apache.org>.

Hi Siva,

Reg the ability to specify distribution, sorting, can they be dynamic? Not
just at table creation time.
Hudi is really a storage system. i.e has a specific layout of data with
multiple tables (ro,rt) exposed.
So all of these "file" management APIs, tend to fit poorly at times.
To your point, we still need to support indexing, transformations.

>>from within our V1's createRelation can't we bypass and start using
hudi_v2 somehow?

That would mean,we still expose V1 datasource APIs to the users right?
Which is what we want to avoid in the first place?

Thanks
Vinoth

On Thu, Jul 15, 2021 at 8:10 PM Sivabalan <n....@gmail.com> wrote:

> I don't have much knowledge wrt catalog, but is there an option of
> exploring spark catalog based table to create a hudi table? I do know with
> spark3.2, you can add Distribution(a.k.a partitioning) and Sort order to
> your table. But still not sure on custom transformation for indexing, etc.
>
> Also, wrt Option2, is there a way to not explicitly ask users to start
> using hudi_V2 once we have one. For eg, from within our V1's createRelation
> can't we bypass and start using hudi_v2 somehow? Directly using hudi_v2
> should also be an option. I need to explore more on these lines, but just
> putting it out.
>
> Once we make some headway in this(by some spark expertise), I can
> definitely contribute from my side on this project.
>
>
> On Thu, Jul 15, 2021 at 12:13 AM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Folks,
> >
> > As you may know, we still use the V1 API, given it the flexibility
> further
> > transform the dataframe, after one calls `df.write.format()`, to
> implement
> > a fully featured write pipeline with precombining, indexing, custom
> > partitioning. V2 API takes this away and rather provides a very
> restrictive
> > API that simply provides a partition level write interface that hands a
> > Iterator<InternalRow>.
> >
> > That said, v2 has evolved (again) in Spark 3 and we would like to get
> with
> > the V2 APIs at some point, for both querying and writing. This thread
> > summarizes a few approaches we can take.
> >
> > *Option 1 : Introduce a pre write hook in Spark Datasource API*
> > If datasource API provided a simple way to further transform dataframes,
> > after the call to df.write.format is done, we would be able to move much
> of
> > our HoodieSparkSQLWriter logic into that and make the transition.
> >
> > Sivabalan engaged with the Spark community around this, without luck.
> > Anyone who can help revive this or make a more successful attempt, please
> > chime in.
> >
> > *Option 2 : Introduce a new datasource hudi_v2 for Spark Datasource +
> > HoodieSparkClient API*
> >
> > We would limit datasource writes to simply bulk_inserts or
> > insert_overwrites. All other write operations would be supported via a
> new
> > HoodieSparkClient API (similar to all the write clients we have, but
> works
> > with DataSet<Row>). Queries will be supported on the v2 APIs. This will
> be
> > done only for Spark 3.
> >
> > We would still keep the current v1 support until Spark supports it.
> > Obviously, users have to migrate pipelines to hudi_v2 at some point, if
> > datasource v1 support is dropped
> >
> > My concern is having two datasources, causing greater confusion for the
> > users.
> >
> > Maybe there are others that I did not list out here. Please add
> >
> > Thanks
> > Vinoth
> >
>
>
> --
> Regards,
> -Sivabalan
>

Re: [DISCUSS] Move to spark v2 datasource API

Posted by Vinoth Chandar <vi...@apache.org>.

Hi Siva,

Reg the ability to specify distribution, sorting, can they be dynamic? Not
just at table creation time.
Hudi is really a storage system. i.e has a specific layout of data with
multiple tables (ro,rt) exposed.
So all of these "file" management APIs, tend to fit poorly at times.
To your point, we still need to support indexing, transformations.

>>from within our V1's createRelation can't we bypass and start using
hudi_v2 somehow?

That would mean,we still expose V1 datasource APIs to the users right?
Which is what we want to avoid in the first place?

Thanks
Vinoth

On Thu, Jul 15, 2021 at 8:10 PM Sivabalan <n....@gmail.com> wrote:

> I don't have much knowledge wrt catalog, but is there an option of
> exploring spark catalog based table to create a hudi table? I do know with
> spark3.2, you can add Distribution(a.k.a partitioning) and Sort order to
> your table. But still not sure on custom transformation for indexing, etc.
>
> Also, wrt Option2, is there a way to not explicitly ask users to start
> using hudi_V2 once we have one. For eg, from within our V1's createRelation
> can't we bypass and start using hudi_v2 somehow? Directly using hudi_v2
> should also be an option. I need to explore more on these lines, but just
> putting it out.
>
> Once we make some headway in this(by some spark expertise), I can
> definitely contribute from my side on this project.
>
>
> On Thu, Jul 15, 2021 at 12:13 AM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Folks,
> >
> > As you may know, we still use the V1 API, given it the flexibility
> further
> > transform the dataframe, after one calls `df.write.format()`, to
> implement
> > a fully featured write pipeline with precombining, indexing, custom
> > partitioning. V2 API takes this away and rather provides a very
> restrictive
> > API that simply provides a partition level write interface that hands a
> > Iterator<InternalRow>.
> >
> > That said, v2 has evolved (again) in Spark 3 and we would like to get
> with
> > the V2 APIs at some point, for both querying and writing. This thread
> > summarizes a few approaches we can take.
> >
> > *Option 1 : Introduce a pre write hook in Spark Datasource API*
> > If datasource API provided a simple way to further transform dataframes,
> > after the call to df.write.format is done, we would be able to move much
> of
> > our HoodieSparkSQLWriter logic into that and make the transition.
> >
> > Sivabalan engaged with the Spark community around this, without luck.
> > Anyone who can help revive this or make a more successful attempt, please
> > chime in.
> >
> > *Option 2 : Introduce a new datasource hudi_v2 for Spark Datasource +
> > HoodieSparkClient API*
> >
> > We would limit datasource writes to simply bulk_inserts or
> > insert_overwrites. All other write operations would be supported via a
> new
> > HoodieSparkClient API (similar to all the write clients we have, but
> works
> > with DataSet<Row>). Queries will be supported on the v2 APIs. This will
> be
> > done only for Spark 3.
> >
> > We would still keep the current v1 support until Spark supports it.
> > Obviously, users have to migrate pipelines to hudi_v2 at some point, if
> > datasource v1 support is dropped
> >
> > My concern is having two datasources, causing greater confusion for the
> > users.
> >
> > Maybe there are others that I did not list out here. Please add
> >
> > Thanks
> > Vinoth
> >
>
>
> --
> Regards,
> -Sivabalan
>

Re: [DISCUSS] Move to spark v2 datasource API

Posted by Sivabalan <n....@gmail.com>.

I don't have much knowledge wrt catalog, but is there an option of
exploring spark catalog based table to create a hudi table? I do know with
spark3.2, you can add Distribution(a.k.a partitioning) and Sort order to
your table. But still not sure on custom transformation for indexing, etc.

Also, wrt Option2, is there a way to not explicitly ask users to start
using hudi_V2 once we have one. For eg, from within our V1's createRelation
can't we bypass and start using hudi_v2 somehow? Directly using hudi_v2
should also be an option. I need to explore more on these lines, but just
putting it out.

Once we make some headway in this(by some spark expertise), I can
definitely contribute from my side on this project.


On Thu, Jul 15, 2021 at 12:13 AM Vinoth Chandar <vi...@apache.org> wrote:

> Folks,
>
> As you may know, we still use the V1 API, given it the flexibility further
> transform the dataframe, after one calls `df.write.format()`, to implement
> a fully featured write pipeline with precombining, indexing, custom
> partitioning. V2 API takes this away and rather provides a very restrictive
> API that simply provides a partition level write interface that hands a
> Iterator<InternalRow>.
>
> That said, v2 has evolved (again) in Spark 3 and we would like to get with
> the V2 APIs at some point, for both querying and writing. This thread
> summarizes a few approaches we can take.
>
> *Option 1 : Introduce a pre write hook in Spark Datasource API*
> If datasource API provided a simple way to further transform dataframes,
> after the call to df.write.format is done, we would be able to move much of
> our HoodieSparkSQLWriter logic into that and make the transition.
>
> Sivabalan engaged with the Spark community around this, without luck.
> Anyone who can help revive this or make a more successful attempt, please
> chime in.
>
> *Option 2 : Introduce a new datasource hudi_v2 for Spark Datasource +
> HoodieSparkClient API*
>
> We would limit datasource writes to simply bulk_inserts or
> insert_overwrites. All other write operations would be supported via a new
> HoodieSparkClient API (similar to all the write clients we have, but works
> with DataSet<Row>). Queries will be supported on the v2 APIs. This will be
> done only for Spark 3.
>
> We would still keep the current v1 support until Spark supports it.
> Obviously, users have to migrate pipelines to hudi_v2 at some point, if
> datasource v1 support is dropped
>
> My concern is having two datasources, causing greater confusion for the
> users.
>
> Maybe there are others that I did not list out here. Please add
>
> Thanks
> Vinoth
>


-- 
Regards,
-Sivabalan

Re: [DISCUSS] Move to spark v2 datasource API

Posted by Sivabalan <n....@gmail.com>.

I don't have much knowledge wrt catalog, but is there an option of
exploring spark catalog based table to create a hudi table? I do know with
spark3.2, you can add Distribution(a.k.a partitioning) and Sort order to
your table. But still not sure on custom transformation for indexing, etc.

Also, wrt Option2, is there a way to not explicitly ask users to start
using hudi_V2 once we have one. For eg, from within our V1's createRelation
can't we bypass and start using hudi_v2 somehow? Directly using hudi_v2
should also be an option. I need to explore more on these lines, but just
putting it out.

Once we make some headway in this(by some spark expertise), I can
definitely contribute from my side on this project.


On Thu, Jul 15, 2021 at 12:13 AM Vinoth Chandar <vi...@apache.org> wrote:

> Folks,
>
> As you may know, we still use the V1 API, given it the flexibility further
> transform the dataframe, after one calls `df.write.format()`, to implement
> a fully featured write pipeline with precombining, indexing, custom
> partitioning. V2 API takes this away and rather provides a very restrictive
> API that simply provides a partition level write interface that hands a
> Iterator<InternalRow>.
>
> That said, v2 has evolved (again) in Spark 3 and we would like to get with
> the V2 APIs at some point, for both querying and writing. This thread
> summarizes a few approaches we can take.
>
> *Option 1 : Introduce a pre write hook in Spark Datasource API*
> If datasource API provided a simple way to further transform dataframes,
> after the call to df.write.format is done, we would be able to move much of
> our HoodieSparkSQLWriter logic into that and make the transition.
>
> Sivabalan engaged with the Spark community around this, without luck.
> Anyone who can help revive this or make a more successful attempt, please
> chime in.
>
> *Option 2 : Introduce a new datasource hudi_v2 for Spark Datasource +
> HoodieSparkClient API*
>
> We would limit datasource writes to simply bulk_inserts or
> insert_overwrites. All other write operations would be supported via a new
> HoodieSparkClient API (similar to all the write clients we have, but works
> with DataSet<Row>). Queries will be supported on the v2 APIs. This will be
> done only for Spark 3.
>
> We would still keep the current v1 support until Spark supports it.
> Obviously, users have to migrate pipelines to hudi_v2 at some point, if
> datasource v1 support is dropped
>
> My concern is having two datasources, causing greater confusion for the
> users.
>
> Maybe there are others that I did not list out here. Please add
>
> Thanks
> Vinoth
>


-- 
Regards,
-Sivabalan