You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by sagar sumit <sa...@gmail.com> on 2021/11/05 03:58:23 UTC

Re: [DISCUSS] Trino Plugin for Hudi

I have created an umbrella JIRA to track this story:
https://issues.apache.org/jira/browse/HUDI-2687
Please also join #trino-hudi-connector channel in Hudi Slack for more
discussion.

Regards,
Sagar

On Thu, Oct 21, 2021 at 5:38 PM sagar sumit <sa...@gmail.com> wrote:

> This patch supports snapshot queries on MOR table:
> https://github.com/trinodb/trino/pull/9641
> That works with the existing hive connector.
>
> Right now, I have only prototyped snapshot queries on COW table with the
> new hudi connector in https://github.com/codope/trino/tree/hudi-plugin
> I will be working on supporting the MOR table as well.
>
> Regards,
> Sagar
>
> On Wed, Oct 20, 2021 at 4:48 PM Jian Feng <ji...@shopee.com> wrote:
>
>> When can Trino support snapshot queries on the Merge-on-read table?
>>
>> On Mon, Oct 18, 2021 at 9:06 PM 周康 <zh...@gmail.com> wrote:
>>
>> > +1 i have send a message on trino slack, really appreciate for the new
>> > trino plugin/connector.
>> > https://trinodb.slack.com/archives/CP1MUNEUX/p1623838591370200
>> >
>> > looking forward to the RFC and more discussion
>> >
>> > On 2021/10/17 06:06:09 sagar sumit wrote:
>> > > Dear Hudi Community,
>> > >
>> > > I would like to propose the development of a new Trino
>> plugin/connector
>> > for
>> > > Hudi.
>> > >
>> > > Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables
>> and
>> > > read-optimized queries on Merge-On-Read tables with Trino, through the
>> > > input format based integration in the Hive connector [1
>> > > <https://github.com/prestodb/presto/commits?author=vinothchandar>].
>> This
>> > > approach has known performance limitations with very large tables,
>> which
>> > > has been since fixed on PrestoDB [2
>> > > <https://prestodb.io/blog/2020/08/04/prestodb-and-hudi>]. We are
>> > working on
>> > > replicating the same fixes on Trino as well [3
>> > > <https://github.com/trinodb/trino/pull/9641>].
>> > >
>> > > However, as Hudi keeps getting better, a new plugin to provide access
>> to
>> > > Hudi data and metadata will help in unlocking those capabilities for
>> the
>> > > Trino users. Just to name a few benefits, metadata-based listing, full
>> > > schema evolution, etc [4
>> > > <
>> >
>> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
>> > >].
>> > > Moreover, a separate Hudi connector would allow its independent
>> evolution
>> > > without having to worry about hacking/breaking the Hive connector.
>> > >
>> > > A separate connector also falls in line with our vision [5
>> > > <
>> >
>> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
>> > >]
>> > > when we think of a standalone timeline server or a lake cache to
>> balance
>> > > the tradeoff between writing and querying. Imagine users having read
>> and
>> > > write access to data and metadata in Hudi directly through Trino.
>> > >
>> > > I did some prototyping to get the snapshot queries on a Hudi COW table
>> > > working with a new plugin [6
>> > > <https://github.com/codope/trino/tree/hudi-plugin>], and I feel the
>> > effort
>> > > is worth it. High-level approach is to implement the connector SPI [7
>> > > <https://trino.io/docs/current/develop/connectors.html>] provided by
>> > Trino
>> > > such as:
>> > > a) HudiMetadata implements ConnectorMetadata to fetch table metadata.
>> > > b) HudiSplit and HudiSplitManager implement ConnectorSplit and
>> > > ConnectorSplitManager to produce logical units of data partitioning,
>> so
>> > > that Trino can parallelize reads and writes.
>> > >
>> > > Let me know your thoughts on the proposal. I can draft an RFC for the
>> > > detailed design discussion once we have consensus.
>> > >
>> > > Regards,
>> > > Sagar
>> > >
>> > > References:
>> > > [1] https://github.com/prestodb/presto/commits?author=vinothchandar
>> > > [2] https://prestodb.io/blog/2020/08/04/prestodb-and-hudi
>> > > [3] https://github.com/trinodb/trino/pull/9641
>> > > [4]
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
>> > > [5]
>> > >
>> >
>> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
>> > > [6] https://github.com/codope/trino/tree/hudi-plugin
>> > > [7] https://trino.io/docs/current/develop/connectors.html
>> > >
>> >
>>
>>
>> --
>> *Jian Feng,冯健*
>> Shopee | Engineer | Data Infrastructure
>>
>

Re: [DISCUSS] Trino Plugin for Hudi

Posted by Vinoth Chandar <vi...@apache.org>.

Could we please kick off an RFC for this?

On Thu, Nov 4, 2021 at 8:58 PM sagar sumit <sa...@gmail.com> wrote:

> I have created an umbrella JIRA to track this story:
> https://issues.apache.org/jira/browse/HUDI-2687
> Please also join #trino-hudi-connector channel in Hudi Slack for more
> discussion.
>
> Regards,
> Sagar
>
> On Thu, Oct 21, 2021 at 5:38 PM sagar sumit <sa...@gmail.com>
> wrote:
>
> > This patch supports snapshot queries on MOR table:
> > https://github.com/trinodb/trino/pull/9641
> > That works with the existing hive connector.
> >
> > Right now, I have only prototyped snapshot queries on COW table with the
> > new hudi connector in https://github.com/codope/trino/tree/hudi-plugin
> > I will be working on supporting the MOR table as well.
> >
> > Regards,
> > Sagar
> >
> > On Wed, Oct 20, 2021 at 4:48 PM Jian Feng <ji...@shopee.com> wrote:
> >
> >> When can Trino support snapshot queries on the Merge-on-read table?
> >>
> >> On Mon, Oct 18, 2021 at 9:06 PM 周康 <zh...@gmail.com> wrote:
> >>
> >> > +1 i have send a message on trino slack, really appreciate for the new
> >> > trino plugin/connector.
> >> > https://trinodb.slack.com/archives/CP1MUNEUX/p1623838591370200
> >> >
> >> > looking forward to the RFC and more discussion
> >> >
> >> > On 2021/10/17 06:06:09 sagar sumit wrote:
> >> > > Dear Hudi Community,
> >> > >
> >> > > I would like to propose the development of a new Trino
> >> plugin/connector
> >> > for
> >> > > Hudi.
> >> > >
> >> > > Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables
> >> and
> >> > > read-optimized queries on Merge-On-Read tables with Trino, through
> the
> >> > > input format based integration in the Hive connector [1
> >> > > <https://github.com/prestodb/presto/commits?author=vinothchandar>].
> >> This
> >> > > approach has known performance limitations with very large tables,
> >> which
> >> > > has been since fixed on PrestoDB [2
> >> > > <https://prestodb.io/blog/2020/08/04/prestodb-and-hudi>]. We are
> >> > working on
> >> > > replicating the same fixes on Trino as well [3
> >> > > <https://github.com/trinodb/trino/pull/9641>].
> >> > >
> >> > > However, as Hudi keeps getting better, a new plugin to provide
> access
> >> to
> >> > > Hudi data and metadata will help in unlocking those capabilities for
> >> the
> >> > > Trino users. Just to name a few benefits, metadata-based listing,
> full
> >> > > schema evolution, etc [4
> >> > > <
> >> >
> >>
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> >> > >].
> >> > > Moreover, a separate Hudi connector would allow its independent
> >> evolution
> >> > > without having to worry about hacking/breaking the Hive connector.
> >> > >
> >> > > A separate connector also falls in line with our vision [5
> >> > > <
> >> >
> >>
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> >> > >]
> >> > > when we think of a standalone timeline server or a lake cache to
> >> balance
> >> > > the tradeoff between writing and querying. Imagine users having read
> >> and
> >> > > write access to data and metadata in Hudi directly through Trino.
> >> > >
> >> > > I did some prototyping to get the snapshot queries on a Hudi COW
> table
> >> > > working with a new plugin [6
> >> > > <https://github.com/codope/trino/tree/hudi-plugin>], and I feel the
> >> > effort
> >> > > is worth it. High-level approach is to implement the connector SPI
> [7
> >> > > <https://trino.io/docs/current/develop/connectors.html>] provided
> by
> >> > Trino
> >> > > such as:
> >> > > a) HudiMetadata implements ConnectorMetadata to fetch table
> metadata.
> >> > > b) HudiSplit and HudiSplitManager implement ConnectorSplit and
> >> > > ConnectorSplitManager to produce logical units of data partitioning,
> >> so
> >> > > that Trino can parallelize reads and writes.
> >> > >
> >> > > Let me know your thoughts on the proposal. I can draft an RFC for
> the
> >> > > detailed design discussion once we have consensus.
> >> > >
> >> > > Regards,
> >> > > Sagar
> >> > >
> >> > > References:
> >> > > [1] https://github.com/prestodb/presto/commits?author=vinothchandar
> >> > > [2] https://prestodb.io/blog/2020/08/04/prestodb-and-hudi
> >> > > [3] https://github.com/trinodb/trino/pull/9641
> >> > > [4]
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> >> > > [5]
> >> > >
> >> >
> >>
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> >> > > [6] https://github.com/codope/trino/tree/hudi-plugin
> >> > > [7] https://trino.io/docs/current/develop/connectors.html
> >> > >
> >> >
> >>
> >>
> >> --
> >> *Jian Feng,冯健*
> >> Shopee | Engineer | Data Infrastructure
> >>
> >
>