You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Mohammad Islam <mi...@yahoo.com.INVALID> on 2015/10/16 20:10:47 UTC

List type to parquet schema

(sorry for being too specific!)
Hi,
What is the good and consistent way of converting a List into a parquet schema?

Let's take an example:


My type is : array<struct<latitude:double,longitude:double>>.
Same type from parquet is giving me this schema:
optional group locations (LIST) {
   repeated group element {
   required double latitude;
   required double longitude;
  }
}

But, in Hive, the same is converted to :
optional group locations (LIST) {
    repeated group bag {  
       optional group array_element {
       optional double latitude;
       optional double longitude;
    }
  }
}

Looks like, Hive is adding a new layer called "bag" into it. I want both to be in the same schema so that it is easier to compare the schema type evolution.

My question is : should we modify one (preferably the Hive side)? If yes, do you have any suggestion?


The corresponding code in Hive :

// An optional group containing a repeated anonymous group "bag", containing

// 1 anonymous element "array_element"
private static GroupType convertArrayType(final String name, final ListTypeInfo typeInfo) {
final TypeInfo subType = typeInfo.getListElementTypeInfo();
return listWrapper(name, OriginalType.LIST, new GroupType(Repetition.REPEATED, 
ParquetHiveSerDe.ARRAY.toString(), convertType("array_element", subType)));
}

Regards,
Mohammad

Source file: 
org.apache.hadoop.hive.ql.io.parquet.convert.
HiveSchemaConverter (
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/HiveSchemaConverter.java).

List type to parquet schema

Posted by Mohammad Islam <mi...@yahoo.com.INVALID>.
(sorry for being too specific!)
Hi,
What is the good and consistent way of converting a List into a parquet schema?

Let's take an example:


My type is : array<struct<latitude:double,longitude:double>>.
Same type from parquet is giving me this schema:
optional group locations (LIST) {
   repeated group element {
   required double latitude;
   required double longitude;
  }
}

But, in Hive, the same is converted to :
optional group locations (LIST) {
    repeated group bag {  
       optional group array_element {
       optional double latitude;
       optional double longitude;
    }
  }
}

Looks like, Hive is adding a new layer called "bag" into it. I want both to be in the same schema so that it is easier to compare the schema type evolution.

My question is : should we modify one (preferably the Hive side)? If yes, do you have any suggestion?


The corresponding code in Hive :

// An optional group containing a repeated anonymous group "bag", containing

// 1 anonymous element "array_element"
private static GroupType convertArrayType(final String name, final ListTypeInfo typeInfo) {
final TypeInfo subType = typeInfo.getListElementTypeInfo();
return listWrapper(name, OriginalType.LIST, new GroupType(Repetition.REPEATED, 
ParquetHiveSerDe.ARRAY.toString(), convertType("array_element", subType)));
}

Regards,
Mohammad

Source file: 
org.apache.hadoop.hive.ql.io.parquet.convert.
HiveSchemaConverter (
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/HiveSchemaConverter.java).