You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Oleksiy Sayankin (JIRA)" <ji...@apache.org> on 2019/05/20 11:41:00 UTC

[jira] [Comment Edited] (ORC-502) Hive ORC read INT, BIGINT as NULL for Data created by Spark

    [ https://issues.apache.org/jira/browse/ORC-502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843882#comment-16843882 ] 

Oleksiy Sayankin edited comment on ORC-502 at 5/20/19 11:40 AM:
----------------------------------------------------------------

*ROOT-CAUSE*

The root-cause of the issue is that column names from origin CSV file do not correspond to column names in Hive table that is used for ORC data. Note, Hive does not distinguish upper case in column names, e.g. column name {{userId}} is transformed to {{userid}} by Hive internally. See table for details.
\\
\\
Table --- column names comparison
||Column name in CSV file||Column name in ORC file||Column name in Hive ORC table in {{CREATE}} statement||Real column name in Hive||
|(1)|(2)|(3)|(4)|
|{{userId}}|{{userId}}|{{userId}}|{{userid}}|
|{{movieId}}|{{movieId}}|{{movieId}}|{{movieid}}|
|{{rating}}|{{rating}}|{{rating}}|{{rating}}|
|{{timestamp}}|{{timestamp}}|{{timestamp}}|{{timestamp}}|
\\
So, real column name in Hive (column (4)) must exactly match column name in CSV (column (1)). This happens only column with name {{rating}} and {{timestamp}}. As you see  {{userId}} does not equal {{userid}}, {{movieId}} does not equal {{movieid}}. 


*SOLUTION*

This is Apache ORC project issue. Solution here is to add normalization to column names in {{org.apache.orc.impl.SchemaEvolution}}

{code}
private static List<String> normalize(List<String> fields){
  List<String> normalized = new ArrayList<>();
  for (String field : fields) {
    normalized.add(field.toLowerCase());
  }
  return normalized;
}
{code}

This section of code moves all columns to low case during execution of method

{code}
void buildConversion(TypeDescription fileType, TypeDescription readerType, int positionalLevels)
{code}

in lines

{code}
List<String> readerFieldNames = normalize(readerType.getFieldNames());
List<String> fileFieldNames = normalize(fileType.getFieldNames());
{code}

Method {{buildConversion()}} builds the mapping from the file type to the reader type. For pre-HIVE-4243 ORC files, the top level structure is matched using position within the row. Otherwise, structs fields are matched by name. And hence we make {{toLowerCase()}} to every field name.

While there is no fix in Apache ORC, I have created a copy of {{org.apache.orc.impl.SchemaEvolution}} class and put it in Apache Hive project with applied fix.

*EFFECTS*

1. ORC Files.
2. ACID transactions.


was (Author: osayankin):
*ROOT-CAUSE*

The root-cause of the issue is that column names from origin CSV file do not correspond to column names in Hive table that is used for ORC data. Note, Hive does not distinguish upper case in column names, e.g. column name {{userId}} is transformed to {{userid}} by Hive internally. See table for details.
\\
\\
Table --- column names comparison
||Column name in CSV file||Column name in ORC file||Column name in Hive ORC table in {{CREATE}} statement||Real column name in Hive||
|(1)|(2)|(3)|(4)|
|{{userId}}|{{userId}}|{{userId}}|{{userid}}|
|{{movieId}}|{{movieId}}|{{movieId}}|{{movieid}}|
|{{rating}}|{{rating}}|{{rating}}|{{rating}}|
|{{timestamp}}|{{timestamp}}|{{timestamp}}|{{timestamp}}|
\\
So, real column name in Hive (column (4)) must exactly match column name in CSV (column (1)). This happens only column with name {{rating}}. As you see  {{userId}} does not equal {{userid}}, {{movieId}} does not equal {{movieid}} and {{timestamp}} does not equal {{timestamp1}}. 


*SOLUTION*

This is Apache ORC project issue. Solution here is to add normalization to column names in {{org.apache.orc.impl.SchemaEvolution}}

{code}
private static List<String> normalize(List<String> fields){
  List<String> normalized = new ArrayList<>();
  for (String field : fields) {
    normalized.add(field.toLowerCase());
  }
  return normalized;
}
{code}

This section of code moves all columns to low case during execution of method

{code}
void buildConversion(TypeDescription fileType, TypeDescription readerType, int positionalLevels)
{code}

in lines

{code}
List<String> readerFieldNames = normalize(readerType.getFieldNames());
List<String> fileFieldNames = normalize(fileType.getFieldNames());
{code}

Method {{buildConversion()}} builds the mapping from the file type to the reader type. For pre-HIVE-4243 ORC files, the top level structure is matched using position within the row. Otherwise, structs fields are matched by name. And hence we make {{toLowerCase()}} to every field name.

While there is no fix in Apache ORC, I have created a copy of {{org.apache.orc.impl.SchemaEvolution}} class and put it in Apache Hive project with applied fix.

*EFFECTS*

1. ORC Files.
2. ACID transactions.

> Hive ORC read INT, BIGINT as NULL for Data created by Spark
> -----------------------------------------------------------
>
>                 Key: ORC-502
>                 URL: https://issues.apache.org/jira/browse/ORC-502
>             Project: ORC
>          Issue Type: Bug
>            Reporter: Oleksiy Sayankin
>            Priority: Major
>         Attachments: data.orc
>
>
> *Preconditions*
> Create file {{ratings.csv}} and put it to HDFS {{/user/test/rating/ratings.csv}}.
> {code}
> userId,movieId,rating,timestamp
> 1,2,4.5,1784325658
> {code}
> See appropriate {{data.orc}} file in attachment.
> *STR:*
> 1. Using spark (tested on version 2.2.1 and 2.3.1) created {{dataframe(df)}} of using {{interSchema}} from a CSV file
> {code}
> val df =spark.read.format("csv").option("header","true").option("inferSchema","true").load("/user/test/rating/ratings.csv")
> {code}
> 2. Now save the df into ORC format file.
> {code}
> df.write.format("orc").save("/user/test/spark_rating_orc_typesafe")
> {code}
> 3. Using hive 2.3. Try creating hive external table respective.
> {code}
> create external table rating_orc_hive_type_1(userId int,movieId int,rating double, `timestamp` int) stored as ORC location "/user/test/spark_orc_rating_typesafe/";
> {code}
> 4. Do query
> {code}
> select * from rating_orc_hive_type_1;
> {code}
> Only double value is printed. Null for integer and even for BIGINT.
> {code}
> OK
> NULL    NULL    4.5     1784325658
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)