You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Peter Vary <pv...@cloudera.com.INVALID> on 2020/11/24 14:50:50 UTC

Iceberg - Hive schema synchronization

Hi Team,

With Shardul we had a longer discussion yesterday about the schema synchronization between Iceberg and Hive, and we thought that it would be good to ask the opinion of the greater community too.

We can have 2 sources for the schemas. 
Hive table definition / schema
Iceberg schema.

If we want Iceberg and Hive to work together we have to find a way to synchronize them. Either by defining a master schema, or by defining a compatibility matrix and conversion for them.
In previous Hive integrations we can see examples for both:
With Avro there is a possibility to read the schema from the data file directly, and the master schema is the one which is in Avro.
With HBase you can provide a mapping between HBase columns by providing the hbase.columns.mapping table property

Maybe the differences are caused by how the storage format is perceived Avro being a simple storage format, HBase being an independent query engine - but his is just a questionable opinion :)

I would like us to decide how Iceberg - Hive integration should be handled.

There are at least 2 questions:
How flexible we should be with the type mapping between Hive and Iceberg types?
Shall we have a strict mapping - This way if we have an Iceberg schema we can immediately derive the Hive schema from it.
Shall we be more relaxed on this - Automatic casting / conversions can be built into the integration, allowing the users to skip view and/or UDF creation for typical conversions
How flexible we should be with column mappings?
Shall we have strict 1-on-1 mapping - This way if we have an Iceberg schema we can immediately derive the Hive schema from it. We still have to omit Iceberg columns which does not have a representation available in Hive.
Shall we allow flexibility on Hive table creation to chose specific Iceberg columns instead of immediately creating a Hive table with all of the columns from the Iceberg table

Currently I would chose:
Strict type mapping because of the following reasons:
Faster execution (we want as few checks and conversions as possible, since it will be executed for every record)
Complexity exponentially increases with every conversion
Flexible column mapping:
I think it will be a typical situation when we have a huge Iceberg table storing the facts with big number of columns and we would like to create multiple Hive tables above that. The problem could be solved by creating the table and adding a view above that table, but I think it would be more user-friendly if we could avoid this extra step.
The added complexity is at table creation / query planning which has far smaller impact on the overall performance

I would love to hear your thoughts as well since the choice should really depend on the user base, and what are the expected use-cases.

Thanks,
Peter


Appendix 1 - Type mapping proposal:
Iceberg type	Hive2 type	Hive3 type	Status
boolean	BOOLEAN	BOOLEAN	OK
int	INTEGER	INTEGER	OK
long	BIGINT	BIGINT	OK
float	FLOAT	FLOAT	OK
double	DOUBLE	DOUBLE	OK
decimal(P,S)	DECIMAL(P,S)	DECIMAL(P,S)	OK
binary	BINARY	BINARY	OK
date	DATE	DATE	OK
timestamp	TIMESTAMP	TIMESTAMP	OK
timestamptz	TIMESTAMP	TIMESTAMP WITH LOCAL TIMEZONE	TODO
string	STRING	STRING	OK
uuid	STRING or BINARY	STRING or BINARY	TODO
time	-	-	-
fixed(L)	-	-	-
-	TINYINT	TINYINT	-
-	SMALLINT	SMALLINT	-
-	INTERVAL	INTERVAL	-
-	VARCHAR	VARCHAR	-
-	CHAR	CHAR	-

Re: Iceberg - Hive schema synchronization

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

A few replies inline.

On Thu, Nov 26, 2020 at 3:49 AM Peter Vary <pv...@cloudera.com> wrote:

> I think the column mapping should also be 1-to-1. Hive would have trouble
> writing to a table if it didn't include all required columns. I think that
> the right thing is for all engines to provide uniform access to all columns.
>
>
> I got your point. I think that Hive would be able to write to the table
> without problem if the given columns are optional (but that is only
> focusing on a small part of the problem)
>
> Would this mean that we have to create a mapping for every iceberg type
> currently not supported by Hive (uuid/time/fixed), and implement something
> like a conversion to and from a string or binary form of the data, so every
> Iceberg column is accessible?
> Related question: How often do we introduce new types to Iceberg?
>

In Spark, we convert UUID to and from String, and we validate the length of
binary data written to a fixed column. I think that time currently throws
an exception (probably not correct). I think it is either reasonable to
ignore a column or to do some minimal work to use a compatible
representation like for UUID and fixed.

Introducing a new type would require an update to the spec, so I expect
that to be very infrequent.

> Last, Iceberg must be the source of truth for schema information. Schema
> in the Hive metastore should always come from Iceberg, or else columns
> would be inaccessible and Hive would require extra DDL to access new
> columns. Hive should have the same view of an Iceberg table that any other
> engine would.
>
>
> This will mean that every access for a given table will generate an extra
> fetch from HSM + S3 and then we have to parse the snapshot files.
> We had similar problem with a partitioned Avro table where the tables
> schema was stored in the Avro file itself. A simple getPartitions call
> floored the system because of the multiple parallel IOs.
>

I suspect that this was a problem with Avro because Hive allows
partition-level customization that Iceberg does not. The current schema for
an Iceberg table applies to all data, and would require only reading the
current metadata file. There would be no operation at partition granularity
to cause trouble.

> Adding to this that Hive creates SerDes often multiple times for the
> queries, we have to find a way to cache the schema to speed things up.
>

Yes, we may need to add a caching layer. I think our recent discussion
about how to interact with catalogs would help quite a bit here. We already
have a caching catalog that avoids loading a table multiple times. Using
that would avoid multiple independent reads of the metadata file.

-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg - Hive schema synchronization

Posted by Peter Vary <pv...@cloudera.com.INVALID>.

Thanks for all the responses.
Added my comments below:

> On Nov 25, 2020, at 23:45, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> I agree that a 1-to-1 type mapping is the right option. Some additional mappings should be supported; I think it should be fine to use VARCHAR in DDL to produce a string column in Iceberg.

If the source of the schema is the Iceberg schema then we should always convert from the Iceberg schema to the Hive schema. If the Hive schema contains additional information, like this column vas VARCHAR at creation time, then we always has to read and merge the 2 schemas to get the final one. So I would go with the strict mapping in this case too. This would simplify the code and most probably the user experience as well.

> Iceberg is also strict about type promotion, and I don't think that we should confuse type promotion with how data is returned for an engine. Using the example of long to string from Vivekanand, it makes no sense for Hive to read all values as strings when another engine would read them as longs. This would also introduce write errors because Hive may attempt to write strings to the table when it must store longs. The right solution for this is type promotion, where the long column is promoted to string. Iceberg doesn't currently support that, but we can loosen those restrictions if we choose.

Agreed.

> I think the column mapping should also be 1-to-1. Hive would have trouble writing to a table if it didn't include all required columns. I think that the right thing is for all engines to provide uniform access to all columns.

I got your point. I think that Hive would be able to write to the table without problem if the given columns are optional (but that is only focusing on a small part of the problem)

Would this mean that we have to create a mapping for every iceberg type currently not supported by Hive (uuid/time/fixed), and implement something like a conversion to and from a string or binary form of the data, so every Iceberg column is accessible?
Related question: How often do we introduce new types to Iceberg?

> Last, Iceberg must be the source of truth for schema information. Schema in the Hive metastore should always come from Iceberg, or else columns would be inaccessible and Hive would require extra DDL to access new columns. Hive should have the same view of an Iceberg table that any other engine would.

This will mean that every access for a given table will generate an extra fetch from HSM + S3 and then we have to parse the snapshot files.
We had similar problem with a partitioned Avro table where the tables schema was stored in the Avro file itself. A simple getPartitions call floored the system because of the multiple parallel IOs.

Adding to this that Hive creates SerDes often multiple times for the queries, we have to find a way to cache the schema to speed things up.

Thanks,
Peter

> 
> rb
> 
> On Wed, Nov 25, 2020 at 8:57 AM Zoltán Borók-Nagy <boroknagyz@apache.org <ma...@apache.org>> wrote:
> Hi Everyone,
> 
> In Impala we face the same challenges. I think a strict 1-to-1 type mapping would be beneficial because that way we could derive the Iceberg schema from the Hive schema, not just the other way around. So we could just naturally create Iceberg tables via DDL.
> 
> We should use the same type mapping for Hive and Impala, which is almost true already. The type mappings in Impala reflect the table where we have a 1-to-1 mapping. Impala also has a mapping between fixed(L) and CHAR(L), although the latter is restricted to UTF8 strings AFAIK. I added this mapping to Impala master but I'm not sure if it's correct. Currently Iceberg definitely dislikes it as it throws exceptions on the read path (because of the UTF8 annotation), so I will probably just remove it.
> 
> Right now Impala throws an exception when there is no type mapping, for example when CREATE TABLE with VARCHAR, or when creating an external table from an Iceberg table that has UUID.
> 
> For column mapping we currently also do a strict mapping. We consider the Iceberg table schema as the source of truth, and always derive the Hive schema from it. We cannot create an external table with only a subset of the columns. And as I mentioned above, we get an exception when facing an unsupported type.
> I think strict column mapping is useful because we might also want to write the tables. VIEWs are more explicit for reading only a subset of columns in my opinion.
> 
> Cheers,
>     Zoltan
> 
> 
> On Wed, Nov 25, 2020 at 5:17 AM Vivekanand Vellanki <vivek@dremio.com <ma...@dremio.com>> wrote:
> Some of the conversions we are seeing are:
> Decimal to Decimal; not just limited to increasing precision as with Iceberg
> varchar<N> to string
> numeric type to numeric type (float to Decimal, double to Decimal, Decimal to double, etc)
> numeric type to string
> 
> On Tue, Nov 24, 2020 at 11:43 PM Owen O'Malley <owen.omalley@gmail.com <ma...@gmail.com>> wrote:
> You left the complex types off of your list (struct, map, array, uniontype). All of them have natural mappings in Iceberg, except for uniontype. Interval is supported on output, but not as a column type. Unfortunately, we have some tables with uniontype, so we'll need a solution for how to deal with it.
> 
> I'm generally in favor of a strict mapping in both type and column mappings. One piece that I think will help a lot is if we add type annotations to Iceberg so that for example we could mark a struct as actually being a uniontype. If someone has the use case where they need to support Hive's char or varchar types it would make sense to define an attribute for the max length.
> 
> Vivekanand, what kind of conversions are you needing. Hive has a *lot* of conversions. Many of those conversions are more error-prone than useful. (For example, I seriously doubt anyone found Hive's conversion of timestamps to booleans useful...)
> 
> .. Owen
> 
> On Tue, Nov 24, 2020 at 3:46 PM Vivekanand Vellanki <vivek@dremio.com <ma...@dremio.com>> wrote:
> One of the challenges we've had is that Hive is more flexible with schema evolution compared to Iceberg. Are you guys also looking at this aspect?
> 
> On Tue, Nov 24, 2020 at 8:21 PM Peter Vary <pv...@cloudera.com.invalid> wrote:
> Hi Team,
> 
> With Shardul we had a longer discussion yesterday about the schema synchronization between Iceberg and Hive, and we thought that it would be good to ask the opinion of the greater community too.
> 
> We can have 2 sources for the schemas. 
> Hive table definition / schema
> Iceberg schema.
> 
> If we want Iceberg and Hive to work together we have to find a way to synchronize them. Either by defining a master schema, or by defining a compatibility matrix and conversion for them.
> In previous Hive integrations we can see examples for both:
> With Avro there is a possibility to read the schema from the data file directly, and the master schema is the one which is in Avro.
> With HBase you can provide a mapping between HBase columns by providing the hbase.columns.mapping table property
> 
> Maybe the differences are caused by how the storage format is perceived Avro being a simple storage format, HBase being an independent query engine - but his is just a questionable opinion :)
> 
> I would like us to decide how Iceberg - Hive integration should be handled.
> 
> There are at least 2 questions:
> How flexible we should be with the type mapping between Hive and Iceberg types?
> Shall we have a strict mapping - This way if we have an Iceberg schema we can immediately derive the Hive schema from it.
> Shall we be more relaxed on this - Automatic casting / conversions can be built into the integration, allowing the users to skip view and/or UDF creation for typical conversions
> How flexible we should be with column mappings?
> Shall we have strict 1-on-1 mapping - This way if we have an Iceberg schema we can immediately derive the Hive schema from it. We still have to omit Iceberg columns which does not have a representation available in Hive.
> Shall we allow flexibility on Hive table creation to chose specific Iceberg columns instead of immediately creating a Hive table with all of the columns from the Iceberg table
> 
> Currently I would chose:
> Strict type mapping because of the following reasons:
> Faster execution (we want as few checks and conversions as possible, since it will be executed for every record)
> Complexity exponentially increases with every conversion
> Flexible column mapping:
> I think it will be a typical situation when we have a huge Iceberg table storing the facts with big number of columns and we would like to create multiple Hive tables above that. The problem could be solved by creating the table and adding a view above that table, but I think it would be more user-friendly if we could avoid this extra step.
> The added complexity is at table creation / query planning which has far smaller impact on the overall performance
> 
> I would love to hear your thoughts as well since the choice should really depend on the user base, and what are the expected use-cases.
> 
> Thanks,
> Peter
> 
> 
> Appendix 1 - Type mapping proposal:
> Iceberg type	Hive2 type	Hive3 type	Status
> boolean	BOOLEAN	BOOLEAN	OK
> int	INTEGER	INTEGER	OK
> long	BIGINT	BIGINT	OK
> float	FLOAT	FLOAT	OK
> double	DOUBLE	DOUBLE	OK
> decimal(P,S)	DECIMAL(P,S)	DECIMAL(P,S)	OK
> binary	BINARY	BINARY	OK
> date	DATE	DATE	OK
> timestamp	TIMESTAMP	TIMESTAMP	OK
> timestamptz	TIMESTAMP	TIMESTAMP WITH LOCAL TIMEZONE	TODO
> string	STRING	STRING	OK
> uuid	STRING or BINARY	STRING or BINARY	TODO
> time	-	-	-
> fixed(L)	-	-	-
> -	TINYINT	TINYINT	-
> -	SMALLINT	SMALLINT	-
> -	INTERVAL	INTERVAL	-
> -	VARCHAR	VARCHAR	-
> -	CHAR	CHAR	-
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: Iceberg - Hive schema synchronization

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I agree that a 1-to-1 type mapping is the right option. Some additional
mappings should be supported; I think it should be fine to use VARCHAR in
DDL to produce a string column in Iceberg.

Iceberg is also strict about type promotion, and I don't think that we
should confuse type promotion with how data is returned for an engine.
Using the example of long to string from Vivekanand, it makes no sense for
Hive to read all values as strings when another engine would read them as
longs. This would also introduce write errors because Hive may attempt to
write strings to the table when it must store longs. The right solution for
this is type promotion, where the long column is promoted to string.
Iceberg doesn't currently support that, but we can loosen those
restrictions if we choose.

I think the column mapping should also be 1-to-1. Hive would have trouble
writing to a table if it didn't include all required columns. I think that
the right thing is for all engines to provide uniform access to all columns.

Last, Iceberg must be the source of truth for schema information. Schema in
the Hive metastore should always come from Iceberg, or else columns would
be inaccessible and Hive would require extra DDL to access new columns.
Hive should have the same view of an Iceberg table that any other engine
would.

rb

On Wed, Nov 25, 2020 at 8:57 AM Zoltán Borók-Nagy <bo...@apache.org>
wrote:

> Hi Everyone,
>
> In Impala we face the same challenges. I think a strict 1-to-1 type
> mapping would be beneficial because that way we could derive the Iceberg
> schema from the Hive schema, not just the other way around. So we could
> just naturally create Iceberg tables via DDL.
>
> We should use the same type mapping for Hive and Impala, which is almost
> true already. The type mappings in Impala reflect the table where we have a
> 1-to-1 mapping. Impala also has a mapping between fixed(L) and CHAR(L),
> although the latter is restricted to UTF8 strings AFAIK. I added this
> mapping to Impala master but I'm not sure if it's correct. Currently
> Iceberg definitely dislikes it as it throws exceptions on the read path
> (because of the UTF8 annotation), so I will probably just remove it.
>
> Right now Impala throws an exception when there is no type mapping,
> for example when CREATE TABLE with VARCHAR, or when creating an external
> table from an Iceberg table that has UUID.
>
> For column mapping we currently also do a strict mapping. We consider the
> Iceberg table schema as the source of truth, and always derive the Hive
> schema from it. We cannot create an external table with only a subset of
> the columns. And as I mentioned above, we get an exception when facing an
> unsupported type.
> I think strict column mapping is useful because we might also want to
> write the tables. VIEWs are more explicit for reading only a subset of
> columns in my opinion.
>
> Cheers,
>     Zoltan
>
>
> On Wed, Nov 25, 2020 at 5:17 AM Vivekanand Vellanki <vi...@dremio.com>
> wrote:
>
>> Some of the conversions we are seeing are:
>>
>>    - Decimal to Decimal; not just limited to increasing precision as
>>    with Iceberg
>>    - varchar<N> to string
>>    - numeric type to numeric type (float to Decimal, double to Decimal,
>>    Decimal to double, etc)
>>    -
>>    - numeric type to string
>>
>>
>> On Tue, Nov 24, 2020 at 11:43 PM Owen O'Malley <ow...@gmail.com>
>> wrote:
>>
>>> You left the complex types off of your list (struct, map, array,
>>> uniontype). All of them have natural mappings in Iceberg, except for
>>> uniontype. Interval is supported on output, but not as a column type.
>>> Unfortunately, we have some tables with uniontype, so we'll need a solution
>>> for how to deal with it.
>>>
>>> I'm generally in favor of a strict mapping in both type and column
>>> mappings. One piece that I think will help a lot is if we add type
>>> annotations to Iceberg so that for example we could mark a struct as
>>> actually being a uniontype. If someone has the use case where they need to
>>> support Hive's char or varchar types it would make sense to define an
>>> attribute for the max length.
>>>
>>> Vivekanand, what kind of conversions are you needing. Hive has a *lot*
>>> of conversions. Many of those conversions are more error-prone than useful.
>>> (For example, I seriously doubt anyone found Hive's conversion of
>>> timestamps to booleans useful...)
>>>
>>> .. Owen
>>>
>>> On Tue, Nov 24, 2020 at 3:46 PM Vivekanand Vellanki <vi...@dremio.com>
>>> wrote:
>>>
>>>> One of the challenges we've had is that Hive is more flexible with
>>>> schema evolution compared to Iceberg. Are you guys also looking at this
>>>> aspect?
>>>>
>>>> On Tue, Nov 24, 2020 at 8:21 PM Peter Vary <pv...@cloudera.com.invalid>
>>>> wrote:
>>>>
>>>>> Hi Team,
>>>>>
>>>>> With Shardul we had a longer discussion yesterday about the schema
>>>>> synchronization between Iceberg and Hive, and we thought that it would be
>>>>> good to ask the opinion of the greater community too.
>>>>>
>>>>> We can have 2 sources for the schemas.
>>>>>
>>>>>    1. Hive table definition / schema
>>>>>    2. Iceberg schema.
>>>>>
>>>>>
>>>>> If we want Iceberg and Hive to work together we have to find a way to
>>>>> synchronize them. Either by defining a master schema, or by defining a
>>>>> compatibility matrix and conversion for them.
>>>>> In previous Hive integrations we can see examples for both:
>>>>>
>>>>>    - With Avro there is a possibility to read the schema from the
>>>>>    data file directly, and the master schema is the one which is in Avro.
>>>>>    - With HBase you can provide a mapping between HBase columns by
>>>>>    providing the *hbase.columns.mapping* table property
>>>>>
>>>>>
>>>>> Maybe the differences are caused by how the storage format is
>>>>> perceived Avro being a simple storage format, HBase being an independent
>>>>> query engine - but his is just a questionable opinion :)
>>>>>
>>>>> I would like us to decide how Iceberg - Hive integration should be
>>>>> handled.
>>>>>
>>>>> There are at least 2 questions:
>>>>>
>>>>>    1. How flexible we should be with the type mapping between Hive
>>>>>    and Iceberg types?
>>>>>       1. Shall we have a strict mapping - This way if we have an
>>>>>       Iceberg schema we can immediately derive the Hive schema from it.
>>>>>       2. Shall we be more relaxed on this - Automatic casting /
>>>>>       conversions can be built into the integration, allowing the users to skip
>>>>>       view and/or UDF creation for typical conversions
>>>>>    2. How flexible we should be with column mappings?
>>>>>       1. Shall we have strict 1-on-1 mapping - This way if we have an
>>>>>       Iceberg schema we can immediately derive the Hive schema from it. We still
>>>>>       have to omit Iceberg columns which does not have a representation available
>>>>>       in Hive.
>>>>>       2. Shall we allow flexibility on Hive table creation to chose
>>>>>       specific Iceberg columns instead of immediately creating a Hive table with
>>>>>       all of the columns from the Iceberg table
>>>>>
>>>>>
>>>>> Currently I would chose:
>>>>>
>>>>>    - Strict type mapping because of the following reasons:
>>>>>       - Faster execution (we want as few checks and conversions as
>>>>>       possible, since it will be executed for every record)
>>>>>       - Complexity exponentially increases with every conversion
>>>>>    - Flexible column mapping:
>>>>>       - I think it will be a typical situation when we have a huge
>>>>>       Iceberg table storing the facts with big number of columns and we would
>>>>>       like to create multiple Hive tables above that. The problem could be solved
>>>>>       by creating the table and adding a view above that table, but I think it
>>>>>       would be more user-friendly if we could avoid this extra step.
>>>>>       - The added complexity is at table creation / query planning
>>>>>       which has far smaller impact on the overall performance
>>>>>
>>>>>
>>>>> I would love to hear your thoughts as well since the choice should
>>>>> really depend on the user base, and what are the expected use-cases.
>>>>>
>>>>> Thanks,
>>>>> Peter
>>>>>
>>>>>
>>>>> Appendix 1 - Type mapping proposal:
>>>>> Iceberg typeHive2 typeHive3 typeStatus
>>>>> boolean BOOLEAN BOOLEAN OK
>>>>> int INTEGER INTEGER OK
>>>>> long BIGINT BIGINT OK
>>>>> float FLOAT FLOAT OK
>>>>> double DOUBLE DOUBLE OK
>>>>> decimal(P,S) DECIMAL(P,S) DECIMAL(P,S) OK
>>>>> binary BINARY BINARY OK
>>>>> date DATE DATE OK
>>>>> timestamp TIMESTAMP TIMESTAMP OK
>>>>> timestamptz TIMESTAMP TIMESTAMP WITH LOCAL TIMEZONE TODO
>>>>> string STRING STRING OK
>>>>> uuid STRING or BINARY STRING or BINARY TODO
>>>>> time - - -
>>>>> fixed(L) - - -
>>>>> - TINYINT TINYINT -
>>>>> - SMALLINT SMALLINT -
>>>>> - INTERVAL INTERVAL -
>>>>> - VARCHAR VARCHAR -
>>>>> - CHAR CHAR -
>>>>>
>>>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg - Hive schema synchronization

Posted by Zoltán Borók-Nagy <bo...@apache.org>.

Hi Everyone,

In Impala we face the same challenges. I think a strict 1-to-1 type mapping
would be beneficial because that way we could derive the Iceberg schema
from the Hive schema, not just the other way around. So we could just
naturally create Iceberg tables via DDL.

We should use the same type mapping for Hive and Impala, which is almost
true already. The type mappings in Impala reflect the table where we have a
1-to-1 mapping. Impala also has a mapping between fixed(L) and CHAR(L),
although the latter is restricted to UTF8 strings AFAIK. I added this
mapping to Impala master but I'm not sure if it's correct. Currently
Iceberg definitely dislikes it as it throws exceptions on the read path
(because of the UTF8 annotation), so I will probably just remove it.

Right now Impala throws an exception when there is no type mapping,
for example when CREATE TABLE with VARCHAR, or when creating an external
table from an Iceberg table that has UUID.

For column mapping we currently also do a strict mapping. We consider the
Iceberg table schema as the source of truth, and always derive the Hive
schema from it. We cannot create an external table with only a subset of
the columns. And as I mentioned above, we get an exception when facing an
unsupported type.
I think strict column mapping is useful because we might also want to write
the tables. VIEWs are more explicit for reading only a subset of columns in
my opinion.

Cheers,
    Zoltan


On Wed, Nov 25, 2020 at 5:17 AM Vivekanand Vellanki <vi...@dremio.com>
wrote:

> Some of the conversions we are seeing are:
>
>    - Decimal to Decimal; not just limited to increasing precision as with
>    Iceberg
>    - varchar<N> to string
>    - numeric type to numeric type (float to Decimal, double to Decimal,
>    Decimal to double, etc)
>    -
>    - numeric type to string
>
>
> On Tue, Nov 24, 2020 at 11:43 PM Owen O'Malley <ow...@gmail.com>
> wrote:
>
>> You left the complex types off of your list (struct, map, array,
>> uniontype). All of them have natural mappings in Iceberg, except for
>> uniontype. Interval is supported on output, but not as a column type.
>> Unfortunately, we have some tables with uniontype, so we'll need a solution
>> for how to deal with it.
>>
>> I'm generally in favor of a strict mapping in both type and column
>> mappings. One piece that I think will help a lot is if we add type
>> annotations to Iceberg so that for example we could mark a struct as
>> actually being a uniontype. If someone has the use case where they need to
>> support Hive's char or varchar types it would make sense to define an
>> attribute for the max length.
>>
>> Vivekanand, what kind of conversions are you needing. Hive has a *lot* of
>> conversions. Many of those conversions are more error-prone than useful.
>> (For example, I seriously doubt anyone found Hive's conversion of
>> timestamps to booleans useful...)
>>
>> .. Owen
>>
>> On Tue, Nov 24, 2020 at 3:46 PM Vivekanand Vellanki <vi...@dremio.com>
>> wrote:
>>
>>> One of the challenges we've had is that Hive is more flexible with
>>> schema evolution compared to Iceberg. Are you guys also looking at this
>>> aspect?
>>>
>>> On Tue, Nov 24, 2020 at 8:21 PM Peter Vary <pv...@cloudera.com.invalid>
>>> wrote:
>>>
>>>> Hi Team,
>>>>
>>>> With Shardul we had a longer discussion yesterday about the schema
>>>> synchronization between Iceberg and Hive, and we thought that it would be
>>>> good to ask the opinion of the greater community too.
>>>>
>>>> We can have 2 sources for the schemas.
>>>>
>>>>    1. Hive table definition / schema
>>>>    2. Iceberg schema.
>>>>
>>>>
>>>> If we want Iceberg and Hive to work together we have to find a way to
>>>> synchronize them. Either by defining a master schema, or by defining a
>>>> compatibility matrix and conversion for them.
>>>> In previous Hive integrations we can see examples for both:
>>>>
>>>>    - With Avro there is a possibility to read the schema from the data
>>>>    file directly, and the master schema is the one which is in Avro.
>>>>    - With HBase you can provide a mapping between HBase columns by
>>>>    providing the *hbase.columns.mapping* table property
>>>>
>>>>
>>>> Maybe the differences are caused by how the storage format is perceived
>>>> Avro being a simple storage format, HBase being an independent query engine
>>>> - but his is just a questionable opinion :)
>>>>
>>>> I would like us to decide how Iceberg - Hive integration should be
>>>> handled.
>>>>
>>>> There are at least 2 questions:
>>>>
>>>>    1. How flexible we should be with the type mapping between Hive and
>>>>    Iceberg types?
>>>>       1. Shall we have a strict mapping - This way if we have an
>>>>       Iceberg schema we can immediately derive the Hive schema from it.
>>>>       2. Shall we be more relaxed on this - Automatic casting /
>>>>       conversions can be built into the integration, allowing the users to skip
>>>>       view and/or UDF creation for typical conversions
>>>>    2. How flexible we should be with column mappings?
>>>>       1. Shall we have strict 1-on-1 mapping - This way if we have an
>>>>       Iceberg schema we can immediately derive the Hive schema from it. We still
>>>>       have to omit Iceberg columns which does not have a representation available
>>>>       in Hive.
>>>>       2. Shall we allow flexibility on Hive table creation to chose
>>>>       specific Iceberg columns instead of immediately creating a Hive table with
>>>>       all of the columns from the Iceberg table
>>>>
>>>>
>>>> Currently I would chose:
>>>>
>>>>    - Strict type mapping because of the following reasons:
>>>>       - Faster execution (we want as few checks and conversions as
>>>>       possible, since it will be executed for every record)
>>>>       - Complexity exponentially increases with every conversion
>>>>    - Flexible column mapping:
>>>>       - I think it will be a typical situation when we have a huge
>>>>       Iceberg table storing the facts with big number of columns and we would
>>>>       like to create multiple Hive tables above that. The problem could be solved
>>>>       by creating the table and adding a view above that table, but I think it
>>>>       would be more user-friendly if we could avoid this extra step.
>>>>       - The added complexity is at table creation / query planning
>>>>       which has far smaller impact on the overall performance
>>>>
>>>>
>>>> I would love to hear your thoughts as well since the choice should
>>>> really depend on the user base, and what are the expected use-cases.
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>>
>>>> Appendix 1 - Type mapping proposal:
>>>> Iceberg typeHive2 typeHive3 typeStatus
>>>> boolean BOOLEAN BOOLEAN OK
>>>> int INTEGER INTEGER OK
>>>> long BIGINT BIGINT OK
>>>> float FLOAT FLOAT OK
>>>> double DOUBLE DOUBLE OK
>>>> decimal(P,S) DECIMAL(P,S) DECIMAL(P,S) OK
>>>> binary BINARY BINARY OK
>>>> date DATE DATE OK
>>>> timestamp TIMESTAMP TIMESTAMP OK
>>>> timestamptz TIMESTAMP TIMESTAMP WITH LOCAL TIMEZONE TODO
>>>> string STRING STRING OK
>>>> uuid STRING or BINARY STRING or BINARY TODO
>>>> time - - -
>>>> fixed(L) - - -
>>>> - TINYINT TINYINT -
>>>> - SMALLINT SMALLINT -
>>>> - INTERVAL INTERVAL -
>>>> - VARCHAR VARCHAR -
>>>> - CHAR CHAR -
>>>>
>>>>

Re: Iceberg - Hive schema synchronization

Posted by Vivekanand Vellanki <vi...@dremio.com>.

Some of the conversions we are seeing are:

   - Decimal to Decimal; not just limited to increasing precision as with
   Iceberg
   - varchar<N> to string
   - numeric type to numeric type (float to Decimal, double to Decimal,
   Decimal to double, etc)
   -
   - numeric type to string


On Tue, Nov 24, 2020 at 11:43 PM Owen O'Malley <ow...@gmail.com>
wrote:

> You left the complex types off of your list (struct, map, array,
> uniontype). All of them have natural mappings in Iceberg, except for
> uniontype. Interval is supported on output, but not as a column type.
> Unfortunately, we have some tables with uniontype, so we'll need a solution
> for how to deal with it.
>
> I'm generally in favor of a strict mapping in both type and column
> mappings. One piece that I think will help a lot is if we add type
> annotations to Iceberg so that for example we could mark a struct as
> actually being a uniontype. If someone has the use case where they need to
> support Hive's char or varchar types it would make sense to define an
> attribute for the max length.
>
> Vivekanand, what kind of conversions are you needing. Hive has a *lot* of
> conversions. Many of those conversions are more error-prone than useful.
> (For example, I seriously doubt anyone found Hive's conversion of
> timestamps to booleans useful...)
>
> .. Owen
>
> On Tue, Nov 24, 2020 at 3:46 PM Vivekanand Vellanki <vi...@dremio.com>
> wrote:
>
>> One of the challenges we've had is that Hive is more flexible with schema
>> evolution compared to Iceberg. Are you guys also looking at this aspect?
>>
>> On Tue, Nov 24, 2020 at 8:21 PM Peter Vary <pv...@cloudera.com.invalid>
>> wrote:
>>
>>> Hi Team,
>>>
>>> With Shardul we had a longer discussion yesterday about the schema
>>> synchronization between Iceberg and Hive, and we thought that it would be
>>> good to ask the opinion of the greater community too.
>>>
>>> We can have 2 sources for the schemas.
>>>
>>>    1. Hive table definition / schema
>>>    2. Iceberg schema.
>>>
>>>
>>> If we want Iceberg and Hive to work together we have to find a way to
>>> synchronize them. Either by defining a master schema, or by defining a
>>> compatibility matrix and conversion for them.
>>> In previous Hive integrations we can see examples for both:
>>>
>>>    - With Avro there is a possibility to read the schema from the data
>>>    file directly, and the master schema is the one which is in Avro.
>>>    - With HBase you can provide a mapping between HBase columns by
>>>    providing the *hbase.columns.mapping* table property
>>>
>>>
>>> Maybe the differences are caused by how the storage format is perceived
>>> Avro being a simple storage format, HBase being an independent query engine
>>> - but his is just a questionable opinion :)
>>>
>>> I would like us to decide how Iceberg - Hive integration should be
>>> handled.
>>>
>>> There are at least 2 questions:
>>>
>>>    1. How flexible we should be with the type mapping between Hive and
>>>    Iceberg types?
>>>       1. Shall we have a strict mapping - This way if we have an
>>>       Iceberg schema we can immediately derive the Hive schema from it.
>>>       2. Shall we be more relaxed on this - Automatic casting /
>>>       conversions can be built into the integration, allowing the users to skip
>>>       view and/or UDF creation for typical conversions
>>>    2. How flexible we should be with column mappings?
>>>       1. Shall we have strict 1-on-1 mapping - This way if we have an
>>>       Iceberg schema we can immediately derive the Hive schema from it. We still
>>>       have to omit Iceberg columns which does not have a representation available
>>>       in Hive.
>>>       2. Shall we allow flexibility on Hive table creation to chose
>>>       specific Iceberg columns instead of immediately creating a Hive table with
>>>       all of the columns from the Iceberg table
>>>
>>>
>>> Currently I would chose:
>>>
>>>    - Strict type mapping because of the following reasons:
>>>       - Faster execution (we want as few checks and conversions as
>>>       possible, since it will be executed for every record)
>>>       - Complexity exponentially increases with every conversion
>>>    - Flexible column mapping:
>>>       - I think it will be a typical situation when we have a huge
>>>       Iceberg table storing the facts with big number of columns and we would
>>>       like to create multiple Hive tables above that. The problem could be solved
>>>       by creating the table and adding a view above that table, but I think it
>>>       would be more user-friendly if we could avoid this extra step.
>>>       - The added complexity is at table creation / query planning
>>>       which has far smaller impact on the overall performance
>>>
>>>
>>> I would love to hear your thoughts as well since the choice should
>>> really depend on the user base, and what are the expected use-cases.
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>>> Appendix 1 - Type mapping proposal:
>>> Iceberg typeHive2 typeHive3 typeStatus
>>> boolean BOOLEAN BOOLEAN OK
>>> int INTEGER INTEGER OK
>>> long BIGINT BIGINT OK
>>> float FLOAT FLOAT OK
>>> double DOUBLE DOUBLE OK
>>> decimal(P,S) DECIMAL(P,S) DECIMAL(P,S) OK
>>> binary BINARY BINARY OK
>>> date DATE DATE OK
>>> timestamp TIMESTAMP TIMESTAMP OK
>>> timestamptz TIMESTAMP TIMESTAMP WITH LOCAL TIMEZONE TODO
>>> string STRING STRING OK
>>> uuid STRING or BINARY STRING or BINARY TODO
>>> time - - -
>>> fixed(L) - - -
>>> - TINYINT TINYINT -
>>> - SMALLINT SMALLINT -
>>> - INTERVAL INTERVAL -
>>> - VARCHAR VARCHAR -
>>> - CHAR CHAR -
>>>
>>>

Re: Iceberg - Hive schema synchronization

Posted by Owen O'Malley <ow...@gmail.com>.

You left the complex types off of your list (struct, map, array,
uniontype). All of them have natural mappings in Iceberg, except for
uniontype. Interval is supported on output, but not as a column type.
Unfortunately, we have some tables with uniontype, so we'll need a solution
for how to deal with it.

I'm generally in favor of a strict mapping in both type and column
mappings. One piece that I think will help a lot is if we add type
annotations to Iceberg so that for example we could mark a struct as
actually being a uniontype. If someone has the use case where they need to
support Hive's char or varchar types it would make sense to define an
attribute for the max length.

Vivekanand, what kind of conversions are you needing. Hive has a *lot* of
conversions. Many of those conversions are more error-prone than useful.
(For example, I seriously doubt anyone found Hive's conversion of
timestamps to booleans useful...)

.. Owen

On Tue, Nov 24, 2020 at 3:46 PM Vivekanand Vellanki <vi...@dremio.com>
wrote:

> One of the challenges we've had is that Hive is more flexible with schema
> evolution compared to Iceberg. Are you guys also looking at this aspect?
>
> On Tue, Nov 24, 2020 at 8:21 PM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
>> Hi Team,
>>
>> With Shardul we had a longer discussion yesterday about the schema
>> synchronization between Iceberg and Hive, and we thought that it would be
>> good to ask the opinion of the greater community too.
>>
>> We can have 2 sources for the schemas.
>>
>>    1. Hive table definition / schema
>>    2. Iceberg schema.
>>
>>
>> If we want Iceberg and Hive to work together we have to find a way to
>> synchronize them. Either by defining a master schema, or by defining a
>> compatibility matrix and conversion for them.
>> In previous Hive integrations we can see examples for both:
>>
>>    - With Avro there is a possibility to read the schema from the data
>>    file directly, and the master schema is the one which is in Avro.
>>    - With HBase you can provide a mapping between HBase columns by
>>    providing the *hbase.columns.mapping* table property
>>
>>
>> Maybe the differences are caused by how the storage format is perceived
>> Avro being a simple storage format, HBase being an independent query engine
>> - but his is just a questionable opinion :)
>>
>> I would like us to decide how Iceberg - Hive integration should be
>> handled.
>>
>> There are at least 2 questions:
>>
>>    1. How flexible we should be with the type mapping between Hive and
>>    Iceberg types?
>>       1. Shall we have a strict mapping - This way if we have an Iceberg
>>       schema we can immediately derive the Hive schema from it.
>>       2. Shall we be more relaxed on this - Automatic casting /
>>       conversions can be built into the integration, allowing the users to skip
>>       view and/or UDF creation for typical conversions
>>    2. How flexible we should be with column mappings?
>>       1. Shall we have strict 1-on-1 mapping - This way if we have an
>>       Iceberg schema we can immediately derive the Hive schema from it. We still
>>       have to omit Iceberg columns which does not have a representation available
>>       in Hive.
>>       2. Shall we allow flexibility on Hive table creation to chose
>>       specific Iceberg columns instead of immediately creating a Hive table with
>>       all of the columns from the Iceberg table
>>
>>
>> Currently I would chose:
>>
>>    - Strict type mapping because of the following reasons:
>>       - Faster execution (we want as few checks and conversions as
>>       possible, since it will be executed for every record)
>>       - Complexity exponentially increases with every conversion
>>    - Flexible column mapping:
>>       - I think it will be a typical situation when we have a huge
>>       Iceberg table storing the facts with big number of columns and we would
>>       like to create multiple Hive tables above that. The problem could be solved
>>       by creating the table and adding a view above that table, but I think it
>>       would be more user-friendly if we could avoid this extra step.
>>       - The added complexity is at table creation / query planning which
>>       has far smaller impact on the overall performance
>>
>>
>> I would love to hear your thoughts as well since the choice should really
>> depend on the user base, and what are the expected use-cases.
>>
>> Thanks,
>> Peter
>>
>>
>> Appendix 1 - Type mapping proposal:
>> Iceberg typeHive2 typeHive3 typeStatus
>> boolean BOOLEAN BOOLEAN OK
>> int INTEGER INTEGER OK
>> long BIGINT BIGINT OK
>> float FLOAT FLOAT OK
>> double DOUBLE DOUBLE OK
>> decimal(P,S) DECIMAL(P,S) DECIMAL(P,S) OK
>> binary BINARY BINARY OK
>> date DATE DATE OK
>> timestamp TIMESTAMP TIMESTAMP OK
>> timestamptz TIMESTAMP TIMESTAMP WITH LOCAL TIMEZONE TODO
>> string STRING STRING OK
>> uuid STRING or BINARY STRING or BINARY TODO
>> time - - -
>> fixed(L) - - -
>> - TINYINT TINYINT -
>> - SMALLINT SMALLINT -
>> - INTERVAL INTERVAL -
>> - VARCHAR VARCHAR -
>> - CHAR CHAR -
>>
>>

Re: Iceberg - Hive schema synchronization

Posted by Vivekanand Vellanki <vi...@dremio.com>.

One of the challenges we've had is that Hive is more flexible with schema
evolution compared to Iceberg. Are you guys also looking at this aspect?

On Tue, Nov 24, 2020 at 8:21 PM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> Hi Team,
>
> With Shardul we had a longer discussion yesterday about the schema
> synchronization between Iceberg and Hive, and we thought that it would be
> good to ask the opinion of the greater community too.
>
> We can have 2 sources for the schemas.
>
>    1. Hive table definition / schema
>    2. Iceberg schema.
>
>
> If we want Iceberg and Hive to work together we have to find a way to
> synchronize them. Either by defining a master schema, or by defining a
> compatibility matrix and conversion for them.
> In previous Hive integrations we can see examples for both:
>
>    - With Avro there is a possibility to read the schema from the data
>    file directly, and the master schema is the one which is in Avro.
>    - With HBase you can provide a mapping between HBase columns by
>    providing the *hbase.columns.mapping* table property
>
>
> Maybe the differences are caused by how the storage format is perceived
> Avro being a simple storage format, HBase being an independent query engine
> - but his is just a questionable opinion :)
>
> I would like us to decide how Iceberg - Hive integration should be handled.
>
> There are at least 2 questions:
>
>    1. How flexible we should be with the type mapping between Hive and
>    Iceberg types?
>       1. Shall we have a strict mapping - This way if we have an Iceberg
>       schema we can immediately derive the Hive schema from it.
>       2. Shall we be more relaxed on this - Automatic casting /
>       conversions can be built into the integration, allowing the users to skip
>       view and/or UDF creation for typical conversions
>    2. How flexible we should be with column mappings?
>       1. Shall we have strict 1-on-1 mapping - This way if we have an
>       Iceberg schema we can immediately derive the Hive schema from it. We still
>       have to omit Iceberg columns which does not have a representation available
>       in Hive.
>       2. Shall we allow flexibility on Hive table creation to chose
>       specific Iceberg columns instead of immediately creating a Hive table with
>       all of the columns from the Iceberg table
>
>
> Currently I would chose:
>
>    - Strict type mapping because of the following reasons:
>       - Faster execution (we want as few checks and conversions as
>       possible, since it will be executed for every record)
>       - Complexity exponentially increases with every conversion
>    - Flexible column mapping:
>       - I think it will be a typical situation when we have a huge
>       Iceberg table storing the facts with big number of columns and we would
>       like to create multiple Hive tables above that. The problem could be solved
>       by creating the table and adding a view above that table, but I think it
>       would be more user-friendly if we could avoid this extra step.
>       - The added complexity is at table creation / query planning which
>       has far smaller impact on the overall performance
>
>
> I would love to hear your thoughts as well since the choice should really
> depend on the user base, and what are the expected use-cases.
>
> Thanks,
> Peter
>
>
> Appendix 1 - Type mapping proposal:
> Iceberg typeHive2 typeHive3 typeStatus
> boolean BOOLEAN BOOLEAN OK
> int INTEGER INTEGER OK
> long BIGINT BIGINT OK
> float FLOAT FLOAT OK
> double DOUBLE DOUBLE OK
> decimal(P,S) DECIMAL(P,S) DECIMAL(P,S) OK
> binary BINARY BINARY OK
> date DATE DATE OK
> timestamp TIMESTAMP TIMESTAMP OK
> timestamptz TIMESTAMP TIMESTAMP WITH LOCAL TIMEZONE TODO
> string STRING STRING OK
> uuid STRING or BINARY STRING or BINARY TODO
> time - - -
> fixed(L) - - -
> - TINYINT TINYINT -
> - SMALLINT SMALLINT -
> - INTERVAL INTERVAL -
> - VARCHAR VARCHAR -
> - CHAR CHAR -
>
>