You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Julien Phalip <jp...@gmail.com> on 2022/05/04 23:18:28 UTC

AvroSerde's inferred schema for NOT NULL columns

Hi,

I'm trying to create a table with a NOT NULL column:  CREATE TABLE mytable
(int_required BIGINT NOT NULL, ....)

However, it looks like the schema that AvroSerde generates ignores the "NOT
NULL" part and outputs the following avro field schema:
{"name":"int_required","type":["null","long"],"default":null}

Looking at AvroSerde's codebase, it seems that it infers the avro schema
based on the "column.types" table property:
https://github.com/apache/hive/blob/release-3.1.2-rc0/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSerDe.java#L118

However, it looks like the "column.types" table property contains no
indication of whether the column might be "NOT NULL". The property just
contains "bigint" for the column. So the AvroSerde assumes that all columns
must accept null values.

This means that the AvroSerde's inferred schema for that field is a
UNION["null", "long"] instead of just LONG.

Is that the intended behavior, or am I missing something?

Thanks,

Julien

Re: AvroSerde's inferred schema for NOT NULL columns

Posted by Peter Vary <pv...@cloudera.com>.
Hi Julien,

Your question again reminds me about the things we were facing with Iceberg :)

In Iceberg there is a possibility to define `required` columns, and we thought it would be a good idea to convert these columns to `NOT NULL` columns in Hive.
We tried to use the HiveMetaHook API to do this, but we later found that the constraints are handled separately from base table metadata manipulation. So basically at table creation time the we do not know anything about the constraints.

For me the issue you are mentioning seems similar in a sense that constraints are not available in the SerDe.

I think it would be good to check the handling of the constraints in the different APIs, but if the current version is not good enough this would definitely need a version upgrade.

Anyway, if you figure out something, I would be interested in the results too :)

Thanks,
Peter

> On 2022. May 5., at 1:18, Julien Phalip <jp...@gmail.com> wrote:
> 
> Hi,
> 
> I'm trying to create a table with a NOT NULL column:  CREATE TABLE mytable (int_required BIGINT NOT NULL, ....)
> 
> However, it looks like the schema that AvroSerde generates ignores the "NOT NULL" part and outputs the following avro field schema: {"name":"int_required","type":["null","long"],"default":null}
> 
> Looking at AvroSerde's codebase, it seems that it infers the avro schema based on the "column.types" table property: https://github.com/apache/hive/blob/release-3.1.2-rc0/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSerDe.java#L118 <https://github.com/apache/hive/blob/release-3.1.2-rc0/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSerDe.java#L118>
> 
> However, it looks like the "column.types" table property contains no indication of whether the column might be "NOT NULL". The property just contains "bigint" for the column. So the AvroSerde assumes that all columns must accept null values.
> 
> This means that the AvroSerde's inferred schema for that field is a UNION["null", "long"] instead of just LONG.
> 
> Is that the intended behavior, or am I missing something?
> 
> Thanks,
> 
> Julien