You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Igor Basko <ig...@gmail.com> on 2020/02/05 07:32:44 UTC

Datasource Writer Schema Evolution

Hi All,
I've tried to write data with some schema changes using the Datasource
Writer.
The procedure was:
First I wrote an event with a specific schema.
After that I wrote a different event with the same schema but with one more
added field.

When I read from the Hudi table, I get both the events, with the original
schema.
I was expecting to get both events with the newer schema with some default
value in the new
field for the first event.

I've created a gist that describes my experience:
https://gist.github.com/igorbasko01/4a1d0cf7c06a5b216382260efaa1f333

Would like to know, if schema evolution is supported using the Datasource
Writer.
Or maybe I'm doing something wrong.

Thanks a lot.

Re: Datasource Writer Schema Evolution

Posted by Vinoth Chandar <vi...@apache.org>.

Hi,

When reading through the datasource API like you are.. The schema merging
etc behaves the same as spak.read.parquet().. Hudi merely filters the files
on storage for the latest snapshot

https://hudi.apache.org/docs/querying_data.html#read-optimized-query-1

thanks
Vinoth

On Thu, Feb 6, 2020 at 8:11 AM leesf <le...@gmail.com> wrote:

> If you update the partition(20200205) after adding fields. it will show the
> add fields by using ` val hudiDF2 =
> spark.read.format("org.apache.hudi").load("/tmp/hudi/drivers/*");
> hudiDF2.show `, which needn't mergeSchema from all files.
>
> Igor Basko <ig...@gmail.com> 于2020年2月6日周四 下午8:35写道：
>
> > Thanks a lot for the answer.
> > I was sure Hudi would store the latest schema, instead of merging it from
> > all the files.
> >
> > On Thu, 6 Feb 2020 at 01:10, leesf <le...@gmail.com> wrote:
> >
> > > Hi Igor,
> > >
> > > It is because the Spark ParquetFileFormat infer schema from the parquet
> > > file under 20200205 dir, and the file do not contains the added
> > > column(direction), you would just try `val hudiDF2 =
> > > spark.read.format("org.apache.hudi").option("mergeSchema",
> > > "true").load("/tmp/hudi/drivers/*")` to get schema merged from 20200205
> > and
> > > 20200206, and it shows the added column, I do not know whether it is a
> > > common soulution but it solves the problem.
> > >
> > > Best,
> > > Leesf
> > > `
> > >
> > > Igor Basko <ig...@gmail.com> 于2020年2月5日周三 下午3:33写道：
> > >
> > > > Hi All,
> > > > I've tried to write data with some schema changes using the
> Datasource
> > > > Writer.
> > > > The procedure was:
> > > > First I wrote an event with a specific schema.
> > > > After that I wrote a different event with the same schema but with
> one
> > > more
> > > > added field.
> > > >
> > > > When I read from the Hudi table, I get both the events, with the
> > original
> > > > schema.
> > > > I was expecting to get both events with the newer schema with some
> > > default
> > > > value in the new
> > > > field for the first event.
> > > >
> > > > I've created a gist that describes my experience:
> > > > https://gist.github.com/igorbasko01/4a1d0cf7c06a5b216382260efaa1f333
> > > >
> > > > Would like to know, if schema evolution is supported using the
> > Datasource
> > > > Writer.
> > > > Or maybe I'm doing something wrong.
> > > >
> > > > Thanks a lot.
> > > >
> > >
> >
>

Re: Datasource Writer Schema Evolution

Posted by leesf <le...@gmail.com>.

If you update the partition(20200205) after adding fields. it will show the
add fields by using ` val hudiDF2 =
spark.read.format("org.apache.hudi").load("/tmp/hudi/drivers/*");
hudiDF2.show `, which needn't mergeSchema from all files.

Igor Basko <ig...@gmail.com> 于2020年2月6日周四 下午8:35写道：

> Thanks a lot for the answer.
> I was sure Hudi would store the latest schema, instead of merging it from
> all the files.
>
> On Thu, 6 Feb 2020 at 01:10, leesf <le...@gmail.com> wrote:
>
> > Hi Igor,
> >
> > It is because the Spark ParquetFileFormat infer schema from the parquet
> > file under 20200205 dir, and the file do not contains the added
> > column(direction), you would just try `val hudiDF2 =
> > spark.read.format("org.apache.hudi").option("mergeSchema",
> > "true").load("/tmp/hudi/drivers/*")` to get schema merged from 20200205
> and
> > 20200206, and it shows the added column, I do not know whether it is a
> > common soulution but it solves the problem.
> >
> > Best,
> > Leesf
> > `
> >
> > Igor Basko <ig...@gmail.com> 于2020年2月5日周三 下午3:33写道：
> >
> > > Hi All,
> > > I've tried to write data with some schema changes using the Datasource
> > > Writer.
> > > The procedure was:
> > > First I wrote an event with a specific schema.
> > > After that I wrote a different event with the same schema but with one
> > more
> > > added field.
> > >
> > > When I read from the Hudi table, I get both the events, with the
> original
> > > schema.
> > > I was expecting to get both events with the newer schema with some
> > default
> > > value in the new
> > > field for the first event.
> > >
> > > I've created a gist that describes my experience:
> > > https://gist.github.com/igorbasko01/4a1d0cf7c06a5b216382260efaa1f333
> > >
> > > Would like to know, if schema evolution is supported using the
> Datasource
> > > Writer.
> > > Or maybe I'm doing something wrong.
> > >
> > > Thanks a lot.
> > >
> >
>

Re: Datasource Writer Schema Evolution

Posted by Igor Basko <ig...@gmail.com>.

Thanks a lot for the answer.
I was sure Hudi would store the latest schema, instead of merging it from
all the files.

On Thu, 6 Feb 2020 at 01:10, leesf <le...@gmail.com> wrote:

> Hi Igor,
>
> It is because the Spark ParquetFileFormat infer schema from the parquet
> file under 20200205 dir, and the file do not contains the added
> column(direction), you would just try `val hudiDF2 =
> spark.read.format("org.apache.hudi").option("mergeSchema",
> "true").load("/tmp/hudi/drivers/*")` to get schema merged from 20200205 and
> 20200206, and it shows the added column, I do not know whether it is a
> common soulution but it solves the problem.
>
> Best,
> Leesf
> `
>
> Igor Basko <ig...@gmail.com> 于2020年2月5日周三 下午3:33写道：
>
> > Hi All,
> > I've tried to write data with some schema changes using the Datasource
> > Writer.
> > The procedure was:
> > First I wrote an event with a specific schema.
> > After that I wrote a different event with the same schema but with one
> more
> > added field.
> >
> > When I read from the Hudi table, I get both the events, with the original
> > schema.
> > I was expecting to get both events with the newer schema with some
> default
> > value in the new
> > field for the first event.
> >
> > I've created a gist that describes my experience:
> > https://gist.github.com/igorbasko01/4a1d0cf7c06a5b216382260efaa1f333
> >
> > Would like to know, if schema evolution is supported using the Datasource
> > Writer.
> > Or maybe I'm doing something wrong.
> >
> > Thanks a lot.
> >
>

Re: Datasource Writer Schema Evolution

Posted by leesf <le...@gmail.com>.

Hi Igor,

It is because the Spark ParquetFileFormat infer schema from the parquet
file under 20200205 dir, and the file do not contains the added
column(direction), you would just try `val hudiDF2 =
spark.read.format("org.apache.hudi").option("mergeSchema",
"true").load("/tmp/hudi/drivers/*")` to get schema merged from 20200205 and
20200206, and it shows the added column, I do not know whether it is a
common soulution but it solves the problem.

Best,
Leesf
`

Igor Basko <ig...@gmail.com> 于2020年2月5日周三 下午3:33写道：

> Hi All,
> I've tried to write data with some schema changes using the Datasource
> Writer.
> The procedure was:
> First I wrote an event with a specific schema.
> After that I wrote a different event with the same schema but with one more
> added field.
>
> When I read from the Hudi table, I get both the events, with the original
> schema.
> I was expecting to get both events with the newer schema with some default
> value in the new
> field for the first event.
>
> I've created a gist that describes my experience:
> https://gist.github.com/igorbasko01/4a1d0cf7c06a5b216382260efaa1f333
>
> Would like to know, if schema evolution is supported using the Datasource
> Writer.
> Or maybe I'm doing something wrong.
>
> Thanks a lot.
>