You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "xsys (Jira)" <ji...@apache.org> on 2022/09/12 14:51:00 UTC
[jira] [Updated] (HIVE-26533) Column data type is lost when an Avro table with a BYTE column is written through spark-sql

     [ https://issues.apache.org/jira/browse/HIVE-26533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

xsys updated HIVE-26533:
------------------------
    Description: 
h3. Describe the bug

We are trying to store a table through the {{spark-sql}} interface with the {{Avro}} file format. The table's schema contains a column with the {{BYTE}} data type. Additionally, the column's name contains uppercase letters.

When we {{INSERT}} some valid values (e.g. {{{}-128{}}}), we see the below message:
{code:java}
WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.{code}
 
Finally, when we perform a {{DESC}} on the table, we observe that the {{BYTE}} data type has been converted to {{{}int{}}}, and the case sensitivity of the column name has been lost (it is converted to lowercase).
h3. Step to reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package:
{code:java}
./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
 
Execute the following:
{code:java}
spark-sql> create table hive_tinyint_avro(c0 INT, C1 BYTE) ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.avro.AvroSerDe" STORED AS INPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat";
22/08/28 15:44:21 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Time taken: 0.359 seconds
spark-sql> insert into hive_tinyint_avro select 0, cast(-128 as byte);
22/08/28 15:44:28 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
22/08/28 15:44:29 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
Time taken: 1.605 seconds
spark-sql> desc hive_tinyint_avro;
22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
c0                      int
c1                      int // Data type and case-sensitivity lost
Time taken: 0.068 seconds, Fetched 2 row(s){code}
h3. Expected behavior

We expect the case sensitivity and data type to be preserved. We tried other formats like Parquet & ORC and the outcome is consistent with this expectation.

Here are the logs from our attempt at doing the same with Parquet:
{noformat}
spark-sql> create table hive_tinyint_parquet(c0 INT, C1 BYTE) stored as PARQUET;
Time taken: 0.134 seconds
spark-sql> insert into hive_tinyint_parquet select 0, cast(-128 as byte);
Time taken: 0.995 seconds
spark-sql> desc hive_tinyint_parquet;
c0                      int
C1                      tinyint  // Data type and case-sensitivity preserved
Time taken: 0.092 seconds, Fetched 2 row(s){noformat}
h3. Root Cause[TypeInfoToSchema|https://github.com/apache/hive/blob/8190d2be7b7165effa62bd21b7d60ef81fb0e4af/serde/src/java/org/apache/hadoop/hive/serde2/avro/TypeInfoToSchema.java#L41]'s [createAvroPrimitive|https://github.com/apache/hive/blob/rel/release-3.1.2/serde/src/java/org/apache/hadoop/hive/serde2/avro/TypeInfoToSchema.java#L124-L132] is where Hive's BYTE, SHORT & INT are all converted into Avro's INT:
{code:java}
      case BYTE:
        schema = Schema.create(Schema.Type.INT);
        break;
      case SHORT:
        schema = Schema.create(Schema.Type.INT);
        break;
      case INT:
        schema = Schema.create(Schema.Type.INT);
        break;
{code}
 
Once converted into Avro schema, we lose track of the actual Hive schema specified by the user. Therefore, once TINYINT/BYTE is converted into INT, the former is lost in the AvroSerde instance.
 

  was:
h3. Describe the bug

We are trying to store a table through the {{spark-sql}} interface with the {{Avro}} file format. The table's schema contains a column with the {{BYTE}} data type. Additionally, the column's name contains uppercase letters.

When we {{INSERT}} some valid values (e.g. {{{}-128{}}}), we see the below message:
{code:java}
WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.{code}
 
Finally, when we perform a {{DESC}} on the table, we observe that the {{BYTE}} data type has been converted to {{{}int{}}}, and the case sensitivity of the column name has been lost (it is converted to lowercase).
h3. Step to reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package:
{code:java}
./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
 
Execute the following:
{code:java}
spark-sql> create table hive_tinyint_avro(c0 INT, C1 BYTE) ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.avro.AvroSerDe" STORED AS INPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat";
22/08/28 15:44:21 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Time taken: 0.359 seconds
spark-sql> insert into hive_tinyint_avro select 0, cast(-128 as byte);
22/08/28 15:44:28 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
22/08/28 15:44:29 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
Time taken: 1.605 seconds
spark-sql> desc hive_tinyint_avro;
22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
c0                      int
c1                      int // Data type and case-sensitivity lost
Time taken: 0.068 seconds, Fetched 2 row(s){code}
 
h3. Expected behavior

We expect the case sensitivity and data type to be preserved. We tried other formats like Parquet & ORC and the outcome is consistent with this expectation.

Here are the logs from our attempt at doing the same with Parquet:
{noformat}
spark-sql> create table hive_tinyint_parquet(c0 INT, C1 BYTE) stored as PARQUET;
Time taken: 0.134 seconds
spark-sql> insert into hive_tinyint_parquet select 0, cast(-128 as byte);
Time taken: 0.995 seconds
spark-sql> desc hive_tinyint_parquet;
c0                      int
C1                      tinyint  // Data type and case-sensitivity preserved
Time taken: 0.092 seconds, Fetched 2 row(s){noformat}
h3. Root Cause
 
[TypeInfoToSchema|https://github.com/apache/hive/blob/8190d2be7b7165effa62bd21b7d60ef81fb0e4af/serde/src/java/org/apache/hadoop/hive/serde2/avro/TypeInfoToSchema.java#L41]'s [createAvroPrimitive|https://github.com/apache/hive/blob/rel/release-3.1.2/serde/src/java/org/apache/hadoop/hive/serde2/avro/TypeInfoToSchema.java#L124-L132] is where Hive's BYTE, SHORT & INT are all converted into Avro's INT:
{code:java}
      case BYTE:
        schema = Schema.create(Schema.Type.INT);
        break;
      case SHORT:
        schema = Schema.create(Schema.Type.INT);
        break;
      case INT:
        schema = Schema.create(Schema.Type.INT);
        break;
{code}
 
Once converted into Avro schema, we lose track of the actual Hive schema specified by the user. Therefore, once TINYINT/BYTE is converted into INT, the former is lost in the AvroSerde instance.
 


> Column data type is lost when an Avro table with a BYTE column is written through spark-sql
> -------------------------------------------------------------------------------------------
>
>                 Key: HIVE-26533
>                 URL: https://issues.apache.org/jira/browse/HIVE-26533
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 3.1.2
>            Reporter: xsys
>            Priority: Major
>
> h3. Describe the bug
> We are trying to store a table through the {{spark-sql}} interface with the {{Avro}} file format. The table's schema contains a column with the {{BYTE}} data type. Additionally, the column's name contains uppercase letters.
> When we {{INSERT}} some valid values (e.g. {{{}-128{}}}), we see the below message:
> {code:java}
> WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.{code}
>  
> Finally, when we perform a {{DESC}} on the table, we observe that the {{BYTE}} data type has been converted to {{{}int{}}}, and the case sensitivity of the column name has been lost (it is converted to lowercase).
> h3. Step to reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package:
> {code:java}
> ./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
>  
> Execute the following:
> {code:java}
> spark-sql> create table hive_tinyint_avro(c0 INT, C1 BYTE) ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.avro.AvroSerDe" STORED AS INPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat";
> 22/08/28 15:44:21 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
> Time taken: 0.359 seconds
> spark-sql> insert into hive_tinyint_avro select 0, cast(-128 as byte);
> 22/08/28 15:44:28 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
> 22/08/28 15:44:29 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
> Time taken: 1.605 seconds
> spark-sql> desc hive_tinyint_avro;
> 22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
> 22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
> c0                      int
> c1                      int // Data type and case-sensitivity lost
> Time taken: 0.068 seconds, Fetched 2 row(s){code}
> h3. Expected behavior
> We expect the case sensitivity and data type to be preserved. We tried other formats like Parquet & ORC and the outcome is consistent with this expectation.
> Here are the logs from our attempt at doing the same with Parquet:
> {noformat}
> spark-sql> create table hive_tinyint_parquet(c0 INT, C1 BYTE) stored as PARQUET;
> Time taken: 0.134 seconds
> spark-sql> insert into hive_tinyint_parquet select 0, cast(-128 as byte);
> Time taken: 0.995 seconds
> spark-sql> desc hive_tinyint_parquet;
> c0                      int
> C1                      tinyint  // Data type and case-sensitivity preserved
> Time taken: 0.092 seconds, Fetched 2 row(s){noformat}
> h3. Root Cause[TypeInfoToSchema|https://github.com/apache/hive/blob/8190d2be7b7165effa62bd21b7d60ef81fb0e4af/serde/src/java/org/apache/hadoop/hive/serde2/avro/TypeInfoToSchema.java#L41]'s [createAvroPrimitive|https://github.com/apache/hive/blob/rel/release-3.1.2/serde/src/java/org/apache/hadoop/hive/serde2/avro/TypeInfoToSchema.java#L124-L132] is where Hive's BYTE, SHORT & INT are all converted into Avro's INT:
> {code:java}
>       case BYTE:
>         schema = Schema.create(Schema.Type.INT);
>         break;
>       case SHORT:
>         schema = Schema.create(Schema.Type.INT);
>         break;
>       case INT:
>         schema = Schema.create(Schema.Type.INT);
>         break;
> {code}
>  
> Once converted into Avro schema, we lose track of the actual Hive schema specified by the user. Therefore, once TINYINT/BYTE is converted into INT, the former is lost in the AvroSerde instance.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)