You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Satish Kotha <sa...@uber.com.INVALID> on 2020/05/29 21:37:12 UTC

[DISCUSS] querying commit metadata from spark DataSource

Hello folks,

We have a use case to incrementally generate data for hudi table (say
'table2')  by transforming data from other hudi table(say, table1). We want
to atomically store commit timestamps read from table1 into table2 commit
metadata.

This is similar to how DeltaStreamer operates with kafka offsets. However,
DeltaStreamer is java code and can easily query kafka offset processed by
creating metaclient for target table. We want to use pyspark and I don't
see a good way to query commit metadata of table1 from DataSource.

I'm considering making one of the below changes to hoodie to make this
easier.

Option1: Add new relation in hudi-spark to query commit metadata. This
relation would present a 'metadata view' to query and filter metadata.

Option2: Add other DataSource options on top of incremental querying to
allow fetching from source table. For example, users can specify
'hoodie.consume.metadata.table: table2BasePath'  and issue incremental
query on table1. Then, IncrementalRelation would go read table2 metadata
first to identify 'consume.start.timestamp' and start incremental read on
table1 with that timestamp.

Option 2 looks simpler to implement. But, seems a bit hacky because we are
reading metadata from table2 when data souce is table1.

Option1 is a bit more complex. But, it is cleaner and not tightly coupled
to incremental reads. For example, use cases other than incremental reads
can leverage same relation to query metadata

What do you guys think? Let me know if there are other simpler solutions.
Appreciate any feedback.

Thanks
Satish

Re: [DISCUSS] querying commit metadata from spark DataSource

Posted by Shiyan Xu <xu...@gmail.com>.

Yes, tickets linked.

On Thu, Jun 11, 2020 at 10:50 AM Vinoth Chandar <vi...@apache.org> wrote:

> Thanks Raymond!
>
> yes.. we can make this a config and leave it to the user to decide if they
> want to use a global table for all their hudi tables (or) keep
> one error table for each hudi table..
>
> For this effort, does it make sense to  take a dependency on the
> multi-writer jira HUDI-944, that liwei filed?
>
> On Wed, Jun 10, 2020 at 7:49 PM Shiyan Xu <xu...@gmail.com>
> wrote:
>
> > Yes, Vinoth, it does go a bit too far with first class support on these
> > data.
> > A global error table can do the job easily. As we discussed yesterday,
> > parallel local error tables with `_errors` suffix could also benefit for
> > some scenarios, like different product teams manage their own tables or
> in
> > 2B case where customers manage their own data. These would prefer good
> > segregation on errors or other related data. Let me note down the points
> in
> > RFC-20 for further discussion. Thanks for the feedback!
> >
> > On Wed, Jun 3, 2020 at 9:31 PM Vinoth Chandar <vi...@apache.org> wrote:
> >
> > > Hi Raymond,
> > >
> > > I am not sure generalizing this to all metadata like - errors and
> > metrics -
> > > would be a good idea. We can certainly implement logging errors to a
> > common
> > > errors hudi table, with a certain schema. But these can be just regular
> > > “hudi” format tables.
> > >
> > > Unlike the timeline metadata, these are really external data, not
> related
> > > to a given table’ core functioning.. we don’t necessarily want to keep
> > one
> > > error table per hudi table..
> > >
> > > Thoughts?
> > >
> > > On Tue, Jun 2, 2020 at 5:34 PM Shiyan Xu <xu...@gmail.com>
> > > wrote:
> > >
> > > > I also encountered use cases where I'd like to programmatically query
> > > > metadata.
> > > > +1 on the idea of format(“hudi-timeline”)
> > > >
> > > > I also feel that the metadata can be extended further to include more
> > > info
> > > > like, errors, metrics/write statistics, etc. Like the newly proposed
> > > error
> > > > handling, we could also store all metrics or write stats there too,
> and
> > > > relate them to the timeline actions.
> > > >
> > > > A potential use case could be, with all these info encapsulated
> within
> > > > metadata, we may be able to derive some insightful results (by check
> > > > against some benchmarks) and answer questions like: does table A need
> > > more
> > > > tuning? does table B exceed error budget?
> > > >
> > > > Programmatic query to these metadata can help manage many tables in
> > > > diagnosis and inspection. We may need different read formats like
> > > > format("hudi-errors") or format("hudi-metrics")
> > > >
> > > > Sorry this sidetracked from the original question..These are really
> > rough
> > > > high-level thoughts, and may have sign of over-engineering. Would
> like
> > to
> > > > hear some feedbacks. Thanks.
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Jun 1, 2020 at 9:28 PM Satish Kotha
> > <satishkotha@uber.com.invalid
> > > >
> > > > wrote:
> > > >
> > > > > Got it. I'll look into implementation choices for creating a new
> data
> > > > > source. Appreciate all the feedback.
> > > > >
> > > > > On Mon, Jun 1, 2020 at 7:53 PM Vinoth Chandar <vi...@apache.org>
> > > wrote:
> > > > >
> > > > > > >Is it to separate data and metadata access?
> > > > > > Correct. We already have modes for querying data using
> > > format("hudi").
> > > > I
> > > > > > feel it will get very confusing to mix data and metadata in the
> > same
> > > > > > source.. for e.g a lot of options we support for data may not
> even
> > > make
> > > > > > sense for the TimelineRelation.
> > > > > >
> > > > > > >This class seems like a list of static methods, I'm not seeing
> > where
> > > > > these
> > > > > > are accessed from
> > > > > > That's the public API for obtaining this information for
> Scala/Java
> > > > > Spark.
> > > > > > If you have a way of calling this from python through some bridge
> > > > without
> > > > > > painful bridges (e.g jython), might be a tactical solution that
> can
> > > > meet
> > > > > > your needs.
> > > > > >
> > > > > > On Mon, Jun 1, 2020 at 5:07 PM Satish Kotha
> > > > <satishkotha@uber.com.invalid
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for the feedback.
> > > > > > >
> > > > > > > What is the advantage of doing
> > > > > > > spark.read.format(“hudi-timeline”).load(basepath) as opposed to
> > > doing
> > > > > new
> > > > > > > relation? Is it to separate data and metadata access?
> > > > > > >
> > > > > > > Are you looking for similar functionality as
> > > HoodieDatasourceHelpers?
> > > > > > > >
> > > > > > > This class seems like a list of static methods, I'm not seeing
> > > where
> > > > > > these
> > > > > > > are accessed from. But, I need a way to query metadata details
> > > easily
> > > > > > > in pyspark.
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar <
> vinoth@apache.org
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Also please take a look at
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE&s=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0&e=
> > > > > > > > .
> > > > > > > >
> > > > > > > > This was an effort to make the timeline more generalized for
> > > > querying
> > > > > > > (for
> > > > > > > > a different purpose).. but good to revisit now..
> > > > > > > >
> > > > > > > > On Sun, May 31, 2020 at 11:04 PM vbalaji@apache.org <
> > > > > > vbalaji@apache.org>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > I strongly recommend using a separate datasource relation
> > > (option
> > > > > 1)
> > > > > > to
> > > > > > > > > query timeline. It is elegant and fits well with spark
> APIs.
> > > > > > > > > Thanks.Balaji.V    On Saturday, May 30, 2020, 01:18:45 PM
> > PDT,
> > > > > Vinoth
> > > > > > > > > Chandar <vi...@apache.org> wrote:
> > > > > > > > >
> > > > > > > > >  Hi satish,
> > > > > > > > >
> > > > > > > > > Are you looking for similar functionality as
> > > > > HoodieDatasourceHelpers?
> > > > > > > > >
> > > > > > > > > We have historically relied on cli to inspect the table,
> > which
> > > > does
> > > > > > not
> > > > > > > > > lend it self well to programmatic access.. overall in like
> > > option
> > > > > 1 -
> > > > > > > > > allowing the timeline to be queryable with a standard
> schema
> > > does
> > > > > > seem
> > > > > > > > way
> > > > > > > > > nicer.
> > > > > > > > >
> > > > > > > > > I am wondering though if we should introduce a new view.
> > > Instead
> > > > we
> > > > > > can
> > > > > > > > use
> > > > > > > > > a different data source name -
> > > > > > > > > spark.read.format(“hudi-timeline”).load(basepath). We can
> > start
> > > > by
> > > > > > just
> > > > > > > > > allowing querying of active timeline and expand this to
> > archive
> > > > > > > timeline?
> > > > > > > > >
> > > > > > > > > What do other Think?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, May 29, 2020 at 2:37 PM Satish Kotha
> > > > > > > > <satishkotha@uber.com.invalid
> > > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello folks,
> > > > > > > > > >
> > > > > > > > > > We have a use case to incrementally generate data for
> hudi
> > > > table
> > > > > > (say
> > > > > > > > > > 'table2')  by transforming data from other hudi
> table(say,
> > > > > table1).
> > > > > > > We
> > > > > > > > > want
> > > > > > > > > > to atomically store commit timestamps read from table1
> into
> > > > > table2
> > > > > > > > commit
> > > > > > > > > > metadata.
> > > > > > > > > >
> > > > > > > > > > This is similar to how DeltaStreamer operates with kafka
> > > > offsets.
> > > > > > > > > However,
> > > > > > > > > > DeltaStreamer is java code and can easily query kafka
> > offset
> > > > > > > processed
> > > > > > > > by
> > > > > > > > > > creating metaclient for target table. We want to use
> > pyspark
> > > > and
> > > > > I
> > > > > > > > don't
> > > > > > > > > > see a good way to query commit metadata of table1 from
> > > > > DataSource.
> > > > > > > > > >
> > > > > > > > > > I'm considering making one of the below changes to hoodie
> > to
> > > > make
> > > > > > > this
> > > > > > > > > > easier.
> > > > > > > > > >
> > > > > > > > > > Option1: Add new relation in hudi-spark to query commit
> > > > metadata.
> > > > > > > This
> > > > > > > > > > relation would present a 'metadata view' to query and
> > filter
> > > > > > > metadata.
> > > > > > > > > >
> > > > > > > > > > Option2: Add other DataSource options on top of
> incremental
> > > > > > querying
> > > > > > > to
> > > > > > > > > > allow fetching from source table. For example, users can
> > > > specify
> > > > > > > > > > 'hoodie.consume.metadata.table: table2BasePath'  and
> issue
> > > > > > > incremental
> > > > > > > > > > query on table1. Then, IncrementalRelation would go read
> > > table2
> > > > > > > > metadata
> > > > > > > > > > first to identify 'consume.start.timestamp' and start
> > > > incremental
> > > > > > > read
> > > > > > > > on
> > > > > > > > > > table1 with that timestamp.
> > > > > > > > > >
> > > > > > > > > > Option 2 looks simpler to implement. But, seems a bit
> hacky
> > > > > because
> > > > > > > we
> > > > > > > > > are
> > > > > > > > > > reading metadata from table2 when data souce is table1.
> > > > > > > > > >
> > > > > > > > > > Option1 is a bit more complex. But, it is cleaner and not
> > > > tightly
> > > > > > > > coupled
> > > > > > > > > > to incremental reads. For example, use cases other than
> > > > > incremental
> > > > > > > > reads
> > > > > > > > > > can leverage same relation to query metadata
> > > > > > > > > >
> > > > > > > > > > What do you guys think? Let me know if there are other
> > > simpler
> > > > > > > > solutions.
> > > > > > > > > > Appreciate any feedback.
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > > Satish
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] querying commit metadata from spark DataSource

Posted by Vinoth Chandar <vi...@apache.org>.

Thanks Raymond!

yes.. we can make this a config and leave it to the user to decide if they
want to use a global table for all their hudi tables (or) keep
one error table for each hudi table..

For this effort, does it make sense to  take a dependency on the
multi-writer jira HUDI-944, that liwei filed?

On Wed, Jun 10, 2020 at 7:49 PM Shiyan Xu <xu...@gmail.com>
wrote:

> Yes, Vinoth, it does go a bit too far with first class support on these
> data.
> A global error table can do the job easily. As we discussed yesterday,
> parallel local error tables with `_errors` suffix could also benefit for
> some scenarios, like different product teams manage their own tables or in
> 2B case where customers manage their own data. These would prefer good
> segregation on errors or other related data. Let me note down the points in
> RFC-20 for further discussion. Thanks for the feedback!
>
> On Wed, Jun 3, 2020 at 9:31 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Hi Raymond,
> >
> > I am not sure generalizing this to all metadata like - errors and
> metrics -
> > would be a good idea. We can certainly implement logging errors to a
> common
> > errors hudi table, with a certain schema. But these can be just regular
> > “hudi” format tables.
> >
> > Unlike the timeline metadata, these are really external data, not related
> > to a given table’ core functioning.. we don’t necessarily want to keep
> one
> > error table per hudi table..
> >
> > Thoughts?
> >
> > On Tue, Jun 2, 2020 at 5:34 PM Shiyan Xu <xu...@gmail.com>
> > wrote:
> >
> > > I also encountered use cases where I'd like to programmatically query
> > > metadata.
> > > +1 on the idea of format(“hudi-timeline”)
> > >
> > > I also feel that the metadata can be extended further to include more
> > info
> > > like, errors, metrics/write statistics, etc. Like the newly proposed
> > error
> > > handling, we could also store all metrics or write stats there too, and
> > > relate them to the timeline actions.
> > >
> > > A potential use case could be, with all these info encapsulated within
> > > metadata, we may be able to derive some insightful results (by check
> > > against some benchmarks) and answer questions like: does table A need
> > more
> > > tuning? does table B exceed error budget?
> > >
> > > Programmatic query to these metadata can help manage many tables in
> > > diagnosis and inspection. We may need different read formats like
> > > format("hudi-errors") or format("hudi-metrics")
> > >
> > > Sorry this sidetracked from the original question..These are really
> rough
> > > high-level thoughts, and may have sign of over-engineering. Would like
> to
> > > hear some feedbacks. Thanks.
> > >
> > >
> > >
> > >
> > > On Mon, Jun 1, 2020 at 9:28 PM Satish Kotha
> <satishkotha@uber.com.invalid
> > >
> > > wrote:
> > >
> > > > Got it. I'll look into implementation choices for creating a new data
> > > > source. Appreciate all the feedback.
> > > >
> > > > On Mon, Jun 1, 2020 at 7:53 PM Vinoth Chandar <vi...@apache.org>
> > wrote:
> > > >
> > > > > >Is it to separate data and metadata access?
> > > > > Correct. We already have modes for querying data using
> > format("hudi").
> > > I
> > > > > feel it will get very confusing to mix data and metadata in the
> same
> > > > > source.. for e.g a lot of options we support for data may not even
> > make
> > > > > sense for the TimelineRelation.
> > > > >
> > > > > >This class seems like a list of static methods, I'm not seeing
> where
> > > > these
> > > > > are accessed from
> > > > > That's the public API for obtaining this information for Scala/Java
> > > > Spark.
> > > > > If you have a way of calling this from python through some bridge
> > > without
> > > > > painful bridges (e.g jython), might be a tactical solution that can
> > > meet
> > > > > your needs.
> > > > >
> > > > > On Mon, Jun 1, 2020 at 5:07 PM Satish Kotha
> > > <satishkotha@uber.com.invalid
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Thanks for the feedback.
> > > > > >
> > > > > > What is the advantage of doing
> > > > > > spark.read.format(“hudi-timeline”).load(basepath) as opposed to
> > doing
> > > > new
> > > > > > relation? Is it to separate data and metadata access?
> > > > > >
> > > > > > Are you looking for similar functionality as
> > HoodieDatasourceHelpers?
> > > > > > >
> > > > > > This class seems like a list of static methods, I'm not seeing
> > where
> > > > > these
> > > > > > are accessed from. But, I need a way to query metadata details
> > easily
> > > > > > in pyspark.
> > > > > >
> > > > > >
> > > > > > On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar <vinoth@apache.org
> >
> > > > wrote:
> > > > > >
> > > > > > > Also please take a look at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE&s=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0&e=
> > > > > > > .
> > > > > > >
> > > > > > > This was an effort to make the timeline more generalized for
> > > querying
> > > > > > (for
> > > > > > > a different purpose).. but good to revisit now..
> > > > > > >
> > > > > > > On Sun, May 31, 2020 at 11:04 PM vbalaji@apache.org <
> > > > > vbalaji@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > I strongly recommend using a separate datasource relation
> > (option
> > > > 1)
> > > > > to
> > > > > > > > query timeline. It is elegant and fits well with spark APIs.
> > > > > > > > Thanks.Balaji.V    On Saturday, May 30, 2020, 01:18:45 PM
> PDT,
> > > > Vinoth
> > > > > > > > Chandar <vi...@apache.org> wrote:
> > > > > > > >
> > > > > > > >  Hi satish,
> > > > > > > >
> > > > > > > > Are you looking for similar functionality as
> > > > HoodieDatasourceHelpers?
> > > > > > > >
> > > > > > > > We have historically relied on cli to inspect the table,
> which
> > > does
> > > > > not
> > > > > > > > lend it self well to programmatic access.. overall in like
> > option
> > > > 1 -
> > > > > > > > allowing the timeline to be queryable with a standard schema
> > does
> > > > > seem
> > > > > > > way
> > > > > > > > nicer.
> > > > > > > >
> > > > > > > > I am wondering though if we should introduce a new view.
> > Instead
> > > we
> > > > > can
> > > > > > > use
> > > > > > > > a different data source name -
> > > > > > > > spark.read.format(“hudi-timeline”).load(basepath). We can
> start
> > > by
> > > > > just
> > > > > > > > allowing querying of active timeline and expand this to
> archive
> > > > > > timeline?
> > > > > > > >
> > > > > > > > What do other Think?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, May 29, 2020 at 2:37 PM Satish Kotha
> > > > > > > <satishkotha@uber.com.invalid
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello folks,
> > > > > > > > >
> > > > > > > > > We have a use case to incrementally generate data for hudi
> > > table
> > > > > (say
> > > > > > > > > 'table2')  by transforming data from other hudi table(say,
> > > > table1).
> > > > > > We
> > > > > > > > want
> > > > > > > > > to atomically store commit timestamps read from table1 into
> > > > table2
> > > > > > > commit
> > > > > > > > > metadata.
> > > > > > > > >
> > > > > > > > > This is similar to how DeltaStreamer operates with kafka
> > > offsets.
> > > > > > > > However,
> > > > > > > > > DeltaStreamer is java code and can easily query kafka
> offset
> > > > > > processed
> > > > > > > by
> > > > > > > > > creating metaclient for target table. We want to use
> pyspark
> > > and
> > > > I
> > > > > > > don't
> > > > > > > > > see a good way to query commit metadata of table1 from
> > > > DataSource.
> > > > > > > > >
> > > > > > > > > I'm considering making one of the below changes to hoodie
> to
> > > make
> > > > > > this
> > > > > > > > > easier.
> > > > > > > > >
> > > > > > > > > Option1: Add new relation in hudi-spark to query commit
> > > metadata.
> > > > > > This
> > > > > > > > > relation would present a 'metadata view' to query and
> filter
> > > > > > metadata.
> > > > > > > > >
> > > > > > > > > Option2: Add other DataSource options on top of incremental
> > > > > querying
> > > > > > to
> > > > > > > > > allow fetching from source table. For example, users can
> > > specify
> > > > > > > > > 'hoodie.consume.metadata.table: table2BasePath'  and issue
> > > > > > incremental
> > > > > > > > > query on table1. Then, IncrementalRelation would go read
> > table2
> > > > > > > metadata
> > > > > > > > > first to identify 'consume.start.timestamp' and start
> > > incremental
> > > > > > read
> > > > > > > on
> > > > > > > > > table1 with that timestamp.
> > > > > > > > >
> > > > > > > > > Option 2 looks simpler to implement. But, seems a bit hacky
> > > > because
> > > > > > we
> > > > > > > > are
> > > > > > > > > reading metadata from table2 when data souce is table1.
> > > > > > > > >
> > > > > > > > > Option1 is a bit more complex. But, it is cleaner and not
> > > tightly
> > > > > > > coupled
> > > > > > > > > to incremental reads. For example, use cases other than
> > > > incremental
> > > > > > > reads
> > > > > > > > > can leverage same relation to query metadata
> > > > > > > > >
> > > > > > > > > What do you guys think? Let me know if there are other
> > simpler
> > > > > > > solutions.
> > > > > > > > > Appreciate any feedback.
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > > Satish
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] querying commit metadata from spark DataSource

Posted by Shiyan Xu <xu...@gmail.com>.

Yes, Vinoth, it does go a bit too far with first class support on these
data.
A global error table can do the job easily. As we discussed yesterday,
parallel local error tables with `_errors` suffix could also benefit for
some scenarios, like different product teams manage their own tables or in
2B case where customers manage their own data. These would prefer good
segregation on errors or other related data. Let me note down the points in
RFC-20 for further discussion. Thanks for the feedback!

On Wed, Jun 3, 2020 at 9:31 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi Raymond,
>
> I am not sure generalizing this to all metadata like - errors and metrics -
> would be a good idea. We can certainly implement logging errors to a common
> errors hudi table, with a certain schema. But these can be just regular
> “hudi” format tables.
>
> Unlike the timeline metadata, these are really external data, not related
> to a given table’ core functioning.. we don’t necessarily want to keep one
> error table per hudi table..
>
> Thoughts?
>
> On Tue, Jun 2, 2020 at 5:34 PM Shiyan Xu <xu...@gmail.com>
> wrote:
>
> > I also encountered use cases where I'd like to programmatically query
> > metadata.
> > +1 on the idea of format(“hudi-timeline”)
> >
> > I also feel that the metadata can be extended further to include more
> info
> > like, errors, metrics/write statistics, etc. Like the newly proposed
> error
> > handling, we could also store all metrics or write stats there too, and
> > relate them to the timeline actions.
> >
> > A potential use case could be, with all these info encapsulated within
> > metadata, we may be able to derive some insightful results (by check
> > against some benchmarks) and answer questions like: does table A need
> more
> > tuning? does table B exceed error budget?
> >
> > Programmatic query to these metadata can help manage many tables in
> > diagnosis and inspection. We may need different read formats like
> > format("hudi-errors") or format("hudi-metrics")
> >
> > Sorry this sidetracked from the original question..These are really rough
> > high-level thoughts, and may have sign of over-engineering. Would like to
> > hear some feedbacks. Thanks.
> >
> >
> >
> >
> > On Mon, Jun 1, 2020 at 9:28 PM Satish Kotha <satishkotha@uber.com.invalid
> >
> > wrote:
> >
> > > Got it. I'll look into implementation choices for creating a new data
> > > source. Appreciate all the feedback.
> > >
> > > On Mon, Jun 1, 2020 at 7:53 PM Vinoth Chandar <vi...@apache.org>
> wrote:
> > >
> > > > >Is it to separate data and metadata access?
> > > > Correct. We already have modes for querying data using
> format("hudi").
> > I
> > > > feel it will get very confusing to mix data and metadata in the same
> > > > source.. for e.g a lot of options we support for data may not even
> make
> > > > sense for the TimelineRelation.
> > > >
> > > > >This class seems like a list of static methods, I'm not seeing where
> > > these
> > > > are accessed from
> > > > That's the public API for obtaining this information for Scala/Java
> > > Spark.
> > > > If you have a way of calling this from python through some bridge
> > without
> > > > painful bridges (e.g jython), might be a tactical solution that can
> > meet
> > > > your needs.
> > > >
> > > > On Mon, Jun 1, 2020 at 5:07 PM Satish Kotha
> > <satishkotha@uber.com.invalid
> > > >
> > > > wrote:
> > > >
> > > > > Thanks for the feedback.
> > > > >
> > > > > What is the advantage of doing
> > > > > spark.read.format(“hudi-timeline”).load(basepath) as opposed to
> doing
> > > new
> > > > > relation? Is it to separate data and metadata access?
> > > > >
> > > > > Are you looking for similar functionality as
> HoodieDatasourceHelpers?
> > > > > >
> > > > > This class seems like a list of static methods, I'm not seeing
> where
> > > > these
> > > > > are accessed from. But, I need a way to query metadata details
> easily
> > > > > in pyspark.
> > > > >
> > > > >
> > > > > On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar <vi...@apache.org>
> > > wrote:
> > > > >
> > > > > > Also please take a look at
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE&s=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0&e=
> > > > > > .
> > > > > >
> > > > > > This was an effort to make the timeline more generalized for
> > querying
> > > > > (for
> > > > > > a different purpose).. but good to revisit now..
> > > > > >
> > > > > > On Sun, May 31, 2020 at 11:04 PM vbalaji@apache.org <
> > > > vbalaji@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > >
> > > > > > > I strongly recommend using a separate datasource relation
> (option
> > > 1)
> > > > to
> > > > > > > query timeline. It is elegant and fits well with spark APIs.
> > > > > > > Thanks.Balaji.V    On Saturday, May 30, 2020, 01:18:45 PM PDT,
> > > Vinoth
> > > > > > > Chandar <vi...@apache.org> wrote:
> > > > > > >
> > > > > > >  Hi satish,
> > > > > > >
> > > > > > > Are you looking for similar functionality as
> > > HoodieDatasourceHelpers?
> > > > > > >
> > > > > > > We have historically relied on cli to inspect the table, which
> > does
> > > > not
> > > > > > > lend it self well to programmatic access.. overall in like
> option
> > > 1 -
> > > > > > > allowing the timeline to be queryable with a standard schema
> does
> > > > seem
> > > > > > way
> > > > > > > nicer.
> > > > > > >
> > > > > > > I am wondering though if we should introduce a new view.
> Instead
> > we
> > > > can
> > > > > > use
> > > > > > > a different data source name -
> > > > > > > spark.read.format(“hudi-timeline”).load(basepath). We can start
> > by
> > > > just
> > > > > > > allowing querying of active timeline and expand this to archive
> > > > > timeline?
> > > > > > >
> > > > > > > What do other Think?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, May 29, 2020 at 2:37 PM Satish Kotha
> > > > > > <satishkotha@uber.com.invalid
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello folks,
> > > > > > > >
> > > > > > > > We have a use case to incrementally generate data for hudi
> > table
> > > > (say
> > > > > > > > 'table2')  by transforming data from other hudi table(say,
> > > table1).
> > > > > We
> > > > > > > want
> > > > > > > > to atomically store commit timestamps read from table1 into
> > > table2
> > > > > > commit
> > > > > > > > metadata.
> > > > > > > >
> > > > > > > > This is similar to how DeltaStreamer operates with kafka
> > offsets.
> > > > > > > However,
> > > > > > > > DeltaStreamer is java code and can easily query kafka offset
> > > > > processed
> > > > > > by
> > > > > > > > creating metaclient for target table. We want to use pyspark
> > and
> > > I
> > > > > > don't
> > > > > > > > see a good way to query commit metadata of table1 from
> > > DataSource.
> > > > > > > >
> > > > > > > > I'm considering making one of the below changes to hoodie to
> > make
> > > > > this
> > > > > > > > easier.
> > > > > > > >
> > > > > > > > Option1: Add new relation in hudi-spark to query commit
> > metadata.
> > > > > This
> > > > > > > > relation would present a 'metadata view' to query and filter
> > > > > metadata.
> > > > > > > >
> > > > > > > > Option2: Add other DataSource options on top of incremental
> > > > querying
> > > > > to
> > > > > > > > allow fetching from source table. For example, users can
> > specify
> > > > > > > > 'hoodie.consume.metadata.table: table2BasePath'  and issue
> > > > > incremental
> > > > > > > > query on table1. Then, IncrementalRelation would go read
> table2
> > > > > > metadata
> > > > > > > > first to identify 'consume.start.timestamp' and start
> > incremental
> > > > > read
> > > > > > on
> > > > > > > > table1 with that timestamp.
> > > > > > > >
> > > > > > > > Option 2 looks simpler to implement. But, seems a bit hacky
> > > because
> > > > > we
> > > > > > > are
> > > > > > > > reading metadata from table2 when data souce is table1.
> > > > > > > >
> > > > > > > > Option1 is a bit more complex. But, it is cleaner and not
> > tightly
> > > > > > coupled
> > > > > > > > to incremental reads. For example, use cases other than
> > > incremental
> > > > > > reads
> > > > > > > > can leverage same relation to query metadata
> > > > > > > >
> > > > > > > > What do you guys think? Let me know if there are other
> simpler
> > > > > > solutions.
> > > > > > > > Appreciate any feedback.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Satish
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] querying commit metadata from spark DataSource

Posted by Vinoth Chandar <vi...@apache.org>.

Hi Raymond,

I am not sure generalizing this to all metadata like - errors and metrics -
would be a good idea. We can certainly implement logging errors to a common
errors hudi table, with a certain schema. But these can be just regular
“hudi” format tables.

Unlike the timeline metadata, these are really external data, not related
to a given table’ core functioning.. we don’t necessarily want to keep one
error table per hudi table..

Thoughts?

On Tue, Jun 2, 2020 at 5:34 PM Shiyan Xu <xu...@gmail.com>
wrote:

> I also encountered use cases where I'd like to programmatically query
> metadata.
> +1 on the idea of format(“hudi-timeline”)
>
> I also feel that the metadata can be extended further to include more info
> like, errors, metrics/write statistics, etc. Like the newly proposed error
> handling, we could also store all metrics or write stats there too, and
> relate them to the timeline actions.
>
> A potential use case could be, with all these info encapsulated within
> metadata, we may be able to derive some insightful results (by check
> against some benchmarks) and answer questions like: does table A need more
> tuning? does table B exceed error budget?
>
> Programmatic query to these metadata can help manage many tables in
> diagnosis and inspection. We may need different read formats like
> format("hudi-errors") or format("hudi-metrics")
>
> Sorry this sidetracked from the original question..These are really rough
> high-level thoughts, and may have sign of over-engineering. Would like to
> hear some feedbacks. Thanks.
>
>
>
>
> On Mon, Jun 1, 2020 at 9:28 PM Satish Kotha <sa...@uber.com.invalid>
> wrote:
>
> > Got it. I'll look into implementation choices for creating a new data
> > source. Appreciate all the feedback.
> >
> > On Mon, Jun 1, 2020 at 7:53 PM Vinoth Chandar <vi...@apache.org> wrote:
> >
> > > >Is it to separate data and metadata access?
> > > Correct. We already have modes for querying data using format("hudi").
> I
> > > feel it will get very confusing to mix data and metadata in the same
> > > source.. for e.g a lot of options we support for data may not even make
> > > sense for the TimelineRelation.
> > >
> > > >This class seems like a list of static methods, I'm not seeing where
> > these
> > > are accessed from
> > > That's the public API for obtaining this information for Scala/Java
> > Spark.
> > > If you have a way of calling this from python through some bridge
> without
> > > painful bridges (e.g jython), might be a tactical solution that can
> meet
> > > your needs.
> > >
> > > On Mon, Jun 1, 2020 at 5:07 PM Satish Kotha
> <satishkotha@uber.com.invalid
> > >
> > > wrote:
> > >
> > > > Thanks for the feedback.
> > > >
> > > > What is the advantage of doing
> > > > spark.read.format(“hudi-timeline”).load(basepath) as opposed to doing
> > new
> > > > relation? Is it to separate data and metadata access?
> > > >
> > > > Are you looking for similar functionality as HoodieDatasourceHelpers?
> > > > >
> > > > This class seems like a list of static methods, I'm not seeing where
> > > these
> > > > are accessed from. But, I need a way to query metadata details easily
> > > > in pyspark.
> > > >
> > > >
> > > > On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar <vi...@apache.org>
> > wrote:
> > > >
> > > > > Also please take a look at
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE&s=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0&e=
> > > > > .
> > > > >
> > > > > This was an effort to make the timeline more generalized for
> querying
> > > > (for
> > > > > a different purpose).. but good to revisit now..
> > > > >
> > > > > On Sun, May 31, 2020 at 11:04 PM vbalaji@apache.org <
> > > vbalaji@apache.org>
> > > > > wrote:
> > > > >
> > > > > >
> > > > > > I strongly recommend using a separate datasource relation (option
> > 1)
> > > to
> > > > > > query timeline. It is elegant and fits well with spark APIs.
> > > > > > Thanks.Balaji.V    On Saturday, May 30, 2020, 01:18:45 PM PDT,
> > Vinoth
> > > > > > Chandar <vi...@apache.org> wrote:
> > > > > >
> > > > > >  Hi satish,
> > > > > >
> > > > > > Are you looking for similar functionality as
> > HoodieDatasourceHelpers?
> > > > > >
> > > > > > We have historically relied on cli to inspect the table, which
> does
> > > not
> > > > > > lend it self well to programmatic access.. overall in like option
> > 1 -
> > > > > > allowing the timeline to be queryable with a standard schema does
> > > seem
> > > > > way
> > > > > > nicer.
> > > > > >
> > > > > > I am wondering though if we should introduce a new view. Instead
> we
> > > can
> > > > > use
> > > > > > a different data source name -
> > > > > > spark.read.format(“hudi-timeline”).load(basepath). We can start
> by
> > > just
> > > > > > allowing querying of active timeline and expand this to archive
> > > > timeline?
> > > > > >
> > > > > > What do other Think?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, May 29, 2020 at 2:37 PM Satish Kotha
> > > > > <satishkotha@uber.com.invalid
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hello folks,
> > > > > > >
> > > > > > > We have a use case to incrementally generate data for hudi
> table
> > > (say
> > > > > > > 'table2')  by transforming data from other hudi table(say,
> > table1).
> > > > We
> > > > > > want
> > > > > > > to atomically store commit timestamps read from table1 into
> > table2
> > > > > commit
> > > > > > > metadata.
> > > > > > >
> > > > > > > This is similar to how DeltaStreamer operates with kafka
> offsets.
> > > > > > However,
> > > > > > > DeltaStreamer is java code and can easily query kafka offset
> > > > processed
> > > > > by
> > > > > > > creating metaclient for target table. We want to use pyspark
> and
> > I
> > > > > don't
> > > > > > > see a good way to query commit metadata of table1 from
> > DataSource.
> > > > > > >
> > > > > > > I'm considering making one of the below changes to hoodie to
> make
> > > > this
> > > > > > > easier.
> > > > > > >
> > > > > > > Option1: Add new relation in hudi-spark to query commit
> metadata.
> > > > This
> > > > > > > relation would present a 'metadata view' to query and filter
> > > > metadata.
> > > > > > >
> > > > > > > Option2: Add other DataSource options on top of incremental
> > > querying
> > > > to
> > > > > > > allow fetching from source table. For example, users can
> specify
> > > > > > > 'hoodie.consume.metadata.table: table2BasePath'  and issue
> > > > incremental
> > > > > > > query on table1. Then, IncrementalRelation would go read table2
> > > > > metadata
> > > > > > > first to identify 'consume.start.timestamp' and start
> incremental
> > > > read
> > > > > on
> > > > > > > table1 with that timestamp.
> > > > > > >
> > > > > > > Option 2 looks simpler to implement. But, seems a bit hacky
> > because
> > > > we
> > > > > > are
> > > > > > > reading metadata from table2 when data souce is table1.
> > > > > > >
> > > > > > > Option1 is a bit more complex. But, it is cleaner and not
> tightly
> > > > > coupled
> > > > > > > to incremental reads. For example, use cases other than
> > incremental
> > > > > reads
> > > > > > > can leverage same relation to query metadata
> > > > > > >
> > > > > > > What do you guys think? Let me know if there are other simpler
> > > > > solutions.
> > > > > > > Appreciate any feedback.
> > > > > > >
> > > > > > > Thanks
> > > > > > > Satish
> > > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] querying commit metadata from spark DataSource

Posted by Shiyan Xu <xu...@gmail.com>.

I also encountered use cases where I'd like to programmatically query
metadata.
+1 on the idea of format(“hudi-timeline”)

I also feel that the metadata can be extended further to include more info
like, errors, metrics/write statistics, etc. Like the newly proposed error
handling, we could also store all metrics or write stats there too, and
relate them to the timeline actions.

A potential use case could be, with all these info encapsulated within
metadata, we may be able to derive some insightful results (by check
against some benchmarks) and answer questions like: does table A need more
tuning? does table B exceed error budget?

Programmatic query to these metadata can help manage many tables in
diagnosis and inspection. We may need different read formats like
format("hudi-errors") or format("hudi-metrics")

Sorry this sidetracked from the original question..These are really rough
high-level thoughts, and may have sign of over-engineering. Would like to
hear some feedbacks. Thanks.




On Mon, Jun 1, 2020 at 9:28 PM Satish Kotha <sa...@uber.com.invalid>
wrote:

> Got it. I'll look into implementation choices for creating a new data
> source. Appreciate all the feedback.
>
> On Mon, Jun 1, 2020 at 7:53 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > >Is it to separate data and metadata access?
> > Correct. We already have modes for querying data using format("hudi"). I
> > feel it will get very confusing to mix data and metadata in the same
> > source.. for e.g a lot of options we support for data may not even make
> > sense for the TimelineRelation.
> >
> > >This class seems like a list of static methods, I'm not seeing where
> these
> > are accessed from
> > That's the public API for obtaining this information for Scala/Java
> Spark.
> > If you have a way of calling this from python through some bridge without
> > painful bridges (e.g jython), might be a tactical solution that can meet
> > your needs.
> >
> > On Mon, Jun 1, 2020 at 5:07 PM Satish Kotha <satishkotha@uber.com.invalid
> >
> > wrote:
> >
> > > Thanks for the feedback.
> > >
> > > What is the advantage of doing
> > > spark.read.format(“hudi-timeline”).load(basepath) as opposed to doing
> new
> > > relation? Is it to separate data and metadata access?
> > >
> > > Are you looking for similar functionality as HoodieDatasourceHelpers?
> > > >
> > > This class seems like a list of static methods, I'm not seeing where
> > these
> > > are accessed from. But, I need a way to query metadata details easily
> > > in pyspark.
> > >
> > >
> > > On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar <vi...@apache.org>
> wrote:
> > >
> > > > Also please take a look at
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE&s=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0&e=
> > > > .
> > > >
> > > > This was an effort to make the timeline more generalized for querying
> > > (for
> > > > a different purpose).. but good to revisit now..
> > > >
> > > > On Sun, May 31, 2020 at 11:04 PM vbalaji@apache.org <
> > vbalaji@apache.org>
> > > > wrote:
> > > >
> > > > >
> > > > > I strongly recommend using a separate datasource relation (option
> 1)
> > to
> > > > > query timeline. It is elegant and fits well with spark APIs.
> > > > > Thanks.Balaji.V    On Saturday, May 30, 2020, 01:18:45 PM PDT,
> Vinoth
> > > > > Chandar <vi...@apache.org> wrote:
> > > > >
> > > > >  Hi satish,
> > > > >
> > > > > Are you looking for similar functionality as
> HoodieDatasourceHelpers?
> > > > >
> > > > > We have historically relied on cli to inspect the table, which does
> > not
> > > > > lend it self well to programmatic access.. overall in like option
> 1 -
> > > > > allowing the timeline to be queryable with a standard schema does
> > seem
> > > > way
> > > > > nicer.
> > > > >
> > > > > I am wondering though if we should introduce a new view. Instead we
> > can
> > > > use
> > > > > a different data source name -
> > > > > spark.read.format(“hudi-timeline”).load(basepath). We can start by
> > just
> > > > > allowing querying of active timeline and expand this to archive
> > > timeline?
> > > > >
> > > > > What do other Think?
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, May 29, 2020 at 2:37 PM Satish Kotha
> > > > <satishkotha@uber.com.invalid
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hello folks,
> > > > > >
> > > > > > We have a use case to incrementally generate data for hudi table
> > (say
> > > > > > 'table2')  by transforming data from other hudi table(say,
> table1).
> > > We
> > > > > want
> > > > > > to atomically store commit timestamps read from table1 into
> table2
> > > > commit
> > > > > > metadata.
> > > > > >
> > > > > > This is similar to how DeltaStreamer operates with kafka offsets.
> > > > > However,
> > > > > > DeltaStreamer is java code and can easily query kafka offset
> > > processed
> > > > by
> > > > > > creating metaclient for target table. We want to use pyspark and
> I
> > > > don't
> > > > > > see a good way to query commit metadata of table1 from
> DataSource.
> > > > > >
> > > > > > I'm considering making one of the below changes to hoodie to make
> > > this
> > > > > > easier.
> > > > > >
> > > > > > Option1: Add new relation in hudi-spark to query commit metadata.
> > > This
> > > > > > relation would present a 'metadata view' to query and filter
> > > metadata.
> > > > > >
> > > > > > Option2: Add other DataSource options on top of incremental
> > querying
> > > to
> > > > > > allow fetching from source table. For example, users can specify
> > > > > > 'hoodie.consume.metadata.table: table2BasePath'  and issue
> > > incremental
> > > > > > query on table1. Then, IncrementalRelation would go read table2
> > > > metadata
> > > > > > first to identify 'consume.start.timestamp' and start incremental
> > > read
> > > > on
> > > > > > table1 with that timestamp.
> > > > > >
> > > > > > Option 2 looks simpler to implement. But, seems a bit hacky
> because
> > > we
> > > > > are
> > > > > > reading metadata from table2 when data souce is table1.
> > > > > >
> > > > > > Option1 is a bit more complex. But, it is cleaner and not tightly
> > > > coupled
> > > > > > to incremental reads. For example, use cases other than
> incremental
> > > > reads
> > > > > > can leverage same relation to query metadata
> > > > > >
> > > > > > What do you guys think? Let me know if there are other simpler
> > > > solutions.
> > > > > > Appreciate any feedback.
> > > > > >
> > > > > > Thanks
> > > > > > Satish
> > > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] querying commit metadata from spark DataSource

Posted by Satish Kotha <sa...@uber.com.INVALID>.

Got it. I'll look into implementation choices for creating a new data
source. Appreciate all the feedback.

On Mon, Jun 1, 2020 at 7:53 PM Vinoth Chandar <vi...@apache.org> wrote:

> >Is it to separate data and metadata access?
> Correct. We already have modes for querying data using format("hudi"). I
> feel it will get very confusing to mix data and metadata in the same
> source.. for e.g a lot of options we support for data may not even make
> sense for the TimelineRelation.
>
> >This class seems like a list of static methods, I'm not seeing where these
> are accessed from
> That's the public API for obtaining this information for Scala/Java Spark.
> If you have a way of calling this from python through some bridge without
> painful bridges (e.g jython), might be a tactical solution that can meet
> your needs.
>
> On Mon, Jun 1, 2020 at 5:07 PM Satish Kotha <sa...@uber.com.invalid>
> wrote:
>
> > Thanks for the feedback.
> >
> > What is the advantage of doing
> > spark.read.format(“hudi-timeline”).load(basepath) as opposed to doing new
> > relation? Is it to separate data and metadata access?
> >
> > Are you looking for similar functionality as HoodieDatasourceHelpers?
> > >
> > This class seems like a list of static methods, I'm not seeing where
> these
> > are accessed from. But, I need a way to query metadata details easily
> > in pyspark.
> >
> >
> > On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar <vi...@apache.org> wrote:
> >
> > > Also please take a look at
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE&s=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0&e=
> > > .
> > >
> > > This was an effort to make the timeline more generalized for querying
> > (for
> > > a different purpose).. but good to revisit now..
> > >
> > > On Sun, May 31, 2020 at 11:04 PM vbalaji@apache.org <
> vbalaji@apache.org>
> > > wrote:
> > >
> > > >
> > > > I strongly recommend using a separate datasource relation (option 1)
> to
> > > > query timeline. It is elegant and fits well with spark APIs.
> > > > Thanks.Balaji.V    On Saturday, May 30, 2020, 01:18:45 PM PDT, Vinoth
> > > > Chandar <vi...@apache.org> wrote:
> > > >
> > > >  Hi satish,
> > > >
> > > > Are you looking for similar functionality as HoodieDatasourceHelpers?
> > > >
> > > > We have historically relied on cli to inspect the table, which does
> not
> > > > lend it self well to programmatic access.. overall in like option 1 -
> > > > allowing the timeline to be queryable with a standard schema does
> seem
> > > way
> > > > nicer.
> > > >
> > > > I am wondering though if we should introduce a new view. Instead we
> can
> > > use
> > > > a different data source name -
> > > > spark.read.format(“hudi-timeline”).load(basepath). We can start by
> just
> > > > allowing querying of active timeline and expand this to archive
> > timeline?
> > > >
> > > > What do other Think?
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, May 29, 2020 at 2:37 PM Satish Kotha
> > > <satishkotha@uber.com.invalid
> > > > >
> > > > wrote:
> > > >
> > > > > Hello folks,
> > > > >
> > > > > We have a use case to incrementally generate data for hudi table
> (say
> > > > > 'table2')  by transforming data from other hudi table(say, table1).
> > We
> > > > want
> > > > > to atomically store commit timestamps read from table1 into table2
> > > commit
> > > > > metadata.
> > > > >
> > > > > This is similar to how DeltaStreamer operates with kafka offsets.
> > > > However,
> > > > > DeltaStreamer is java code and can easily query kafka offset
> > processed
> > > by
> > > > > creating metaclient for target table. We want to use pyspark and I
> > > don't
> > > > > see a good way to query commit metadata of table1 from DataSource.
> > > > >
> > > > > I'm considering making one of the below changes to hoodie to make
> > this
> > > > > easier.
> > > > >
> > > > > Option1: Add new relation in hudi-spark to query commit metadata.
> > This
> > > > > relation would present a 'metadata view' to query and filter
> > metadata.
> > > > >
> > > > > Option2: Add other DataSource options on top of incremental
> querying
> > to
> > > > > allow fetching from source table. For example, users can specify
> > > > > 'hoodie.consume.metadata.table: table2BasePath'  and issue
> > incremental
> > > > > query on table1. Then, IncrementalRelation would go read table2
> > > metadata
> > > > > first to identify 'consume.start.timestamp' and start incremental
> > read
> > > on
> > > > > table1 with that timestamp.
> > > > >
> > > > > Option 2 looks simpler to implement. But, seems a bit hacky because
> > we
> > > > are
> > > > > reading metadata from table2 when data souce is table1.
> > > > >
> > > > > Option1 is a bit more complex. But, it is cleaner and not tightly
> > > coupled
> > > > > to incremental reads. For example, use cases other than incremental
> > > reads
> > > > > can leverage same relation to query metadata
> > > > >
> > > > > What do you guys think? Let me know if there are other simpler
> > > solutions.
> > > > > Appreciate any feedback.
> > > > >
> > > > > Thanks
> > > > > Satish
> > > > >
> > >
> >
>

Re: [DISCUSS] querying commit metadata from spark DataSource

Posted by Vinoth Chandar <vi...@apache.org>.

>Is it to separate data and metadata access?
Correct. We already have modes for querying data using format("hudi"). I
feel it will get very confusing to mix data and metadata in the same
source.. for e.g a lot of options we support for data may not even make
sense for the TimelineRelation.

>This class seems like a list of static methods, I'm not seeing where these
are accessed from
That's the public API for obtaining this information for Scala/Java Spark.
If you have a way of calling this from python through some bridge without
painful bridges (e.g jython), might be a tactical solution that can meet
your needs.

On Mon, Jun 1, 2020 at 5:07 PM Satish Kotha <sa...@uber.com.invalid>
wrote:

> Thanks for the feedback.
>
> What is the advantage of doing
> spark.read.format(“hudi-timeline”).load(basepath) as opposed to doing new
> relation? Is it to separate data and metadata access?
>
> Are you looking for similar functionality as HoodieDatasourceHelpers?
> >
> This class seems like a list of static methods, I'm not seeing where these
> are accessed from. But, I need a way to query metadata details easily
> in pyspark.
>
>
> On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Also please take a look at
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE&s=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0&e=
> > .
> >
> > This was an effort to make the timeline more generalized for querying
> (for
> > a different purpose).. but good to revisit now..
> >
> > On Sun, May 31, 2020 at 11:04 PM vbalaji@apache.org <vb...@apache.org>
> > wrote:
> >
> > >
> > > I strongly recommend using a separate datasource relation (option 1) to
> > > query timeline. It is elegant and fits well with spark APIs.
> > > Thanks.Balaji.V    On Saturday, May 30, 2020, 01:18:45 PM PDT, Vinoth
> > > Chandar <vi...@apache.org> wrote:
> > >
> > >  Hi satish,
> > >
> > > Are you looking for similar functionality as HoodieDatasourceHelpers?
> > >
> > > We have historically relied on cli to inspect the table, which does not
> > > lend it self well to programmatic access.. overall in like option 1 -
> > > allowing the timeline to be queryable with a standard schema does seem
> > way
> > > nicer.
> > >
> > > I am wondering though if we should introduce a new view. Instead we can
> > use
> > > a different data source name -
> > > spark.read.format(“hudi-timeline”).load(basepath). We can start by just
> > > allowing querying of active timeline and expand this to archive
> timeline?
> > >
> > > What do other Think?
> > >
> > >
> > >
> > >
> > > On Fri, May 29, 2020 at 2:37 PM Satish Kotha
> > <satishkotha@uber.com.invalid
> > > >
> > > wrote:
> > >
> > > > Hello folks,
> > > >
> > > > We have a use case to incrementally generate data for hudi table (say
> > > > 'table2')  by transforming data from other hudi table(say, table1).
> We
> > > want
> > > > to atomically store commit timestamps read from table1 into table2
> > commit
> > > > metadata.
> > > >
> > > > This is similar to how DeltaStreamer operates with kafka offsets.
> > > However,
> > > > DeltaStreamer is java code and can easily query kafka offset
> processed
> > by
> > > > creating metaclient for target table. We want to use pyspark and I
> > don't
> > > > see a good way to query commit metadata of table1 from DataSource.
> > > >
> > > > I'm considering making one of the below changes to hoodie to make
> this
> > > > easier.
> > > >
> > > > Option1: Add new relation in hudi-spark to query commit metadata.
> This
> > > > relation would present a 'metadata view' to query and filter
> metadata.
> > > >
> > > > Option2: Add other DataSource options on top of incremental querying
> to
> > > > allow fetching from source table. For example, users can specify
> > > > 'hoodie.consume.metadata.table: table2BasePath'  and issue
> incremental
> > > > query on table1. Then, IncrementalRelation would go read table2
> > metadata
> > > > first to identify 'consume.start.timestamp' and start incremental
> read
> > on
> > > > table1 with that timestamp.
> > > >
> > > > Option 2 looks simpler to implement. But, seems a bit hacky because
> we
> > > are
> > > > reading metadata from table2 when data souce is table1.
> > > >
> > > > Option1 is a bit more complex. But, it is cleaner and not tightly
> > coupled
> > > > to incremental reads. For example, use cases other than incremental
> > reads
> > > > can leverage same relation to query metadata
> > > >
> > > > What do you guys think? Let me know if there are other simpler
> > solutions.
> > > > Appreciate any feedback.
> > > >
> > > > Thanks
> > > > Satish
> > > >
> >
>

Re: [DISCUSS] querying commit metadata from spark DataSource

Posted by Satish Kotha <sa...@uber.com.INVALID>.

Thanks for the feedback.

What is the advantage of doing
spark.read.format(“hudi-timeline”).load(basepath) as opposed to doing new
relation? Is it to separate data and metadata access?

Are you looking for similar functionality as HoodieDatasourceHelpers?
>
This class seems like a list of static methods, I'm not seeing where these
are accessed from. But, I need a way to query metadata details easily
in pyspark.


On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar <vi...@apache.org> wrote:

> Also please take a look at
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE&s=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0&e=
> .
>
> This was an effort to make the timeline more generalized for querying (for
> a different purpose).. but good to revisit now..
>
> On Sun, May 31, 2020 at 11:04 PM vbalaji@apache.org <vb...@apache.org>
> wrote:
>
> >
> > I strongly recommend using a separate datasource relation (option 1) to
> > query timeline. It is elegant and fits well with spark APIs.
> > Thanks.Balaji.V    On Saturday, May 30, 2020, 01:18:45 PM PDT, Vinoth
> > Chandar <vi...@apache.org> wrote:
> >
> >  Hi satish,
> >
> > Are you looking for similar functionality as HoodieDatasourceHelpers?
> >
> > We have historically relied on cli to inspect the table, which does not
> > lend it self well to programmatic access.. overall in like option 1 -
> > allowing the timeline to be queryable with a standard schema does seem
> way
> > nicer.
> >
> > I am wondering though if we should introduce a new view. Instead we can
> use
> > a different data source name -
> > spark.read.format(“hudi-timeline”).load(basepath). We can start by just
> > allowing querying of active timeline and expand this to archive timeline?
> >
> > What do other Think?
> >
> >
> >
> >
> > On Fri, May 29, 2020 at 2:37 PM Satish Kotha
> <satishkotha@uber.com.invalid
> > >
> > wrote:
> >
> > > Hello folks,
> > >
> > > We have a use case to incrementally generate data for hudi table (say
> > > 'table2')  by transforming data from other hudi table(say, table1). We
> > want
> > > to atomically store commit timestamps read from table1 into table2
> commit
> > > metadata.
> > >
> > > This is similar to how DeltaStreamer operates with kafka offsets.
> > However,
> > > DeltaStreamer is java code and can easily query kafka offset processed
> by
> > > creating metaclient for target table. We want to use pyspark and I
> don't
> > > see a good way to query commit metadata of table1 from DataSource.
> > >
> > > I'm considering making one of the below changes to hoodie to make this
> > > easier.
> > >
> > > Option1: Add new relation in hudi-spark to query commit metadata. This
> > > relation would present a 'metadata view' to query and filter metadata.
> > >
> > > Option2: Add other DataSource options on top of incremental querying to
> > > allow fetching from source table. For example, users can specify
> > > 'hoodie.consume.metadata.table: table2BasePath'  and issue incremental
> > > query on table1. Then, IncrementalRelation would go read table2
> metadata
> > > first to identify 'consume.start.timestamp' and start incremental read
> on
> > > table1 with that timestamp.
> > >
> > > Option 2 looks simpler to implement. But, seems a bit hacky because we
> > are
> > > reading metadata from table2 when data souce is table1.
> > >
> > > Option1 is a bit more complex. But, it is cleaner and not tightly
> coupled
> > > to incremental reads. For example, use cases other than incremental
> reads
> > > can leverage same relation to query metadata
> > >
> > > What do you guys think? Let me know if there are other simpler
> solutions.
> > > Appreciate any feedback.
> > >
> > > Thanks
> > > Satish
> > >
>

Re: [DISCUSS] querying commit metadata from spark DataSource

Posted by Vinoth Chandar <vi...@apache.org>.

Also please take a look at https://issues.apache.org/jira/browse/HUDI-309.

This was an effort to make the timeline more generalized for querying (for
a different purpose).. but good to revisit now..

On Sun, May 31, 2020 at 11:04 PM vbalaji@apache.org <vb...@apache.org>
wrote:

>
> I strongly recommend using a separate datasource relation (option 1) to
> query timeline. It is elegant and fits well with spark APIs.
> Thanks.Balaji.V    On Saturday, May 30, 2020, 01:18:45 PM PDT, Vinoth
> Chandar <vi...@apache.org> wrote:
>
>  Hi satish,
>
> Are you looking for similar functionality as HoodieDatasourceHelpers?
>
> We have historically relied on cli to inspect the table, which does not
> lend it self well to programmatic access.. overall in like option 1 -
> allowing the timeline to be queryable with a standard schema does seem way
> nicer.
>
> I am wondering though if we should introduce a new view. Instead we can use
> a different data source name -
> spark.read.format(“hudi-timeline”).load(basepath). We can start by just
> allowing querying of active timeline and expand this to archive timeline?
>
> What do other Think?
>
>
>
>
> On Fri, May 29, 2020 at 2:37 PM Satish Kotha <satishkotha@uber.com.invalid
> >
> wrote:
>
> > Hello folks,
> >
> > We have a use case to incrementally generate data for hudi table (say
> > 'table2')  by transforming data from other hudi table(say, table1). We
> want
> > to atomically store commit timestamps read from table1 into table2 commit
> > metadata.
> >
> > This is similar to how DeltaStreamer operates with kafka offsets.
> However,
> > DeltaStreamer is java code and can easily query kafka offset processed by
> > creating metaclient for target table. We want to use pyspark and I don't
> > see a good way to query commit metadata of table1 from DataSource.
> >
> > I'm considering making one of the below changes to hoodie to make this
> > easier.
> >
> > Option1: Add new relation in hudi-spark to query commit metadata. This
> > relation would present a 'metadata view' to query and filter metadata.
> >
> > Option2: Add other DataSource options on top of incremental querying to
> > allow fetching from source table. For example, users can specify
> > 'hoodie.consume.metadata.table: table2BasePath'  and issue incremental
> > query on table1. Then, IncrementalRelation would go read table2 metadata
> > first to identify 'consume.start.timestamp' and start incremental read on
> > table1 with that timestamp.
> >
> > Option 2 looks simpler to implement. But, seems a bit hacky because we
> are
> > reading metadata from table2 when data souce is table1.
> >
> > Option1 is a bit more complex. But, it is cleaner and not tightly coupled
> > to incremental reads. For example, use cases other than incremental reads
> > can leverage same relation to query metadata
> >
> > What do you guys think? Let me know if there are other simpler solutions.
> > Appreciate any feedback.
> >
> > Thanks
> > Satish
> >

Re: [DISCUSS] querying commit metadata from spark DataSource

Posted by "vbalaji@apache.org" <vb...@apache.org>.

I strongly recommend using a separate datasource relation (option 1) to query timeline. It is elegant and fits well with spark APIs.
Thanks.Balaji.V    On Saturday, May 30, 2020, 01:18:45 PM PDT, Vinoth Chandar <vi...@apache.org> wrote:  

 Hi satish,

Are you looking for similar functionality as HoodieDatasourceHelpers?

We have historically relied on cli to inspect the table, which does not
lend it self well to programmatic access.. overall in like option 1 -
allowing the timeline to be queryable with a standard schema does seem way
nicer.

I am wondering though if we should introduce a new view. Instead we can use
a different data source name -
spark.read.format(“hudi-timeline”).load(basepath). We can start by just
allowing querying of active timeline and expand this to archive timeline?

What do other Think?

On Fri, May 29, 2020 at 2:37 PM Satish Kotha <sa...@uber.com.invalid>
wrote:

> Hello folks,
>
> We have a use case to incrementally generate data for hudi table (say
> 'table2')  by transforming data from other hudi table(say, table1). We want
> to atomically store commit timestamps read from table1 into table2 commit
> metadata.
>
> This is similar to how DeltaStreamer operates with kafka offsets. However,
> DeltaStreamer is java code and can easily query kafka offset processed by
> creating metaclient for target table. We want to use pyspark and I don't
> see a good way to query commit metadata of table1 from DataSource.
>
> I'm considering making one of the below changes to hoodie to make this
> easier.
>
> Option1: Add new relation in hudi-spark to query commit metadata. This
> relation would present a 'metadata view' to query and filter metadata.
>
> Option2: Add other DataSource options on top of incremental querying to
> allow fetching from source table. For example, users can specify
> 'hoodie.consume.metadata.table: table2BasePath'  and issue incremental
> query on table1. Then, IncrementalRelation would go read table2 metadata
> first to identify 'consume.start.timestamp' and start incremental read on
> table1 with that timestamp.
>
> Option 2 looks simpler to implement. But, seems a bit hacky because we are
> reading metadata from table2 when data souce is table1.
>
> Option1 is a bit more complex. But, it is cleaner and not tightly coupled
> to incremental reads. For example, use cases other than incremental reads
> can leverage same relation to query metadata
>
> What do you guys think? Let me know if there are other simpler solutions.
> Appreciate any feedback.
>
> Thanks
> Satish
>

Re: [DISCUSS] querying commit metadata from spark DataSource

Posted by Vinoth Chandar <vi...@apache.org>.

Hi satish,

Are you looking for similar functionality as HoodieDatasourceHelpers?

We have historically relied on cli to inspect the table, which does not
lend it self well to programmatic access.. overall in like option 1 -
allowing the timeline to be queryable with a standard schema does seem way
nicer.

I am wondering though if we should introduce a new view. Instead we can use
a different data source name -
spark.read.format(“hudi-timeline”).load(basepath). We can start by just
allowing querying of active timeline and expand this to archive timeline?

What do other Think?




On Fri, May 29, 2020 at 2:37 PM Satish Kotha <sa...@uber.com.invalid>
wrote:

> Hello folks,
>
> We have a use case to incrementally generate data for hudi table (say
> 'table2')  by transforming data from other hudi table(say, table1). We want
> to atomically store commit timestamps read from table1 into table2 commit
> metadata.
>
> This is similar to how DeltaStreamer operates with kafka offsets. However,
> DeltaStreamer is java code and can easily query kafka offset processed by
> creating metaclient for target table. We want to use pyspark and I don't
> see a good way to query commit metadata of table1 from DataSource.
>
> I'm considering making one of the below changes to hoodie to make this
> easier.
>
> Option1: Add new relation in hudi-spark to query commit metadata. This
> relation would present a 'metadata view' to query and filter metadata.
>
> Option2: Add other DataSource options on top of incremental querying to
> allow fetching from source table. For example, users can specify
> 'hoodie.consume.metadata.table: table2BasePath'  and issue incremental
> query on table1. Then, IncrementalRelation would go read table2 metadata
> first to identify 'consume.start.timestamp' and start incremental read on
> table1 with that timestamp.
>
> Option 2 looks simpler to implement. But, seems a bit hacky because we are
> reading metadata from table2 when data souce is table1.
>
> Option1 is a bit more complex. But, it is cleaner and not tightly coupled
> to incremental reads. For example, use cases other than incremental reads
> can leverage same relation to query metadata
>
> What do you guys think? Let me know if there are other simpler solutions.
> Appreciate any feedback.
>
> Thanks
> Satish
>