You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by abhijeet bedagkar <qa...@gmail.com> on 2018/05/16 12:43:21 UTC
Datafarme save as table operation is failing when the child columns
name contains special characters
Hi,
I am using SPARK to read the XML / JSON files to create a dataframe and
save it as a hive table
Sample XML file:
<revolt_configuration>
<id>101</id>
<testexecutioncontroller>
<execution-timeout>45</execution-timeout>
<execution->COMMAND</execution-method>
</testexecutioncontroller>
</revolt_configuration>
Note field 'validation-timeout' under testexecutioncontroller.
Below is the schema populated by DF after reading the XML file
|-- id: long (nullable = true)
|-- testexecutioncontroller: struct (nullable = true)
| |-- execution-timeout: long (nullable = true)
| |-- execution-method: string (nullable = true)
While saving this dataframe to hive table I am getting below exception
Caused by: java.lang.IllegalArgumentException: Error: : expected at the
position 24 of
'bigint:struct<execution-timeout:bigint,execution-method:string>' but '-'
is found. at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360)
at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:483)
at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
at
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:111)
at
org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
at
org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
at
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
at
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
at
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
at org.apache
It looks like the issue is happening due to special character '-' in the
field. As after removing the special character it iw working properly.
Could you please let me know if there is way to replaces all child column
names so that it can be saved as table without any issue.
Creating the STRUCT FIELD from df.schema and recursively creating another
STRUCTFIELD with renamed column is one solution I am aware of. But still
wanted to check if there is easy way to do this.
Thanks,
Abhijeet
Re: Datafarme save as table operation is failing when the child
columns name contains special characters
Posted by abhijeet bedagkar <qa...@gmail.com>.
I further dig down into this issue and 1. Seems like this issue originates
from hive meta-store since when tried to execute query with sub-column
containing special characters and despite adding backtick it did not work
for me 2. I solved this issue by explicitly passing SQL expression to the
data frame by updating special character from sub columns
Ex
source data :
{
"address": {
"lane-one": "mark street",
"lane:two": "sub stree"
}
}
Python CODE:
schema = 'struct<lane_one:string, lane_two:string>'
data_frame_from_json.select(col('address').cast(schema))
I have verified the data for much more complex JSON and XML structure and
it looks good.
Thanks,
Abhijeet
On Wed, May 16, 2018 at 6:13 PM, abhijeet bedagkar <qa...@gmail.com>
wrote:
> Hi,
>
> I am using SPARK to read the XML / JSON files to create a dataframe and
> save it as a hive table
>
> Sample XML file:
> <revolt_configuration>
> <id>101</id>
> <testexecutioncontroller>
> <execution-timeout>45</execution-timeout>
> <execution->COMMAND</execution-method>
> </testexecutioncontroller>
> </revolt_configuration>
>
> Note field 'validation-timeout' under testexecutioncontroller.
>
> Below is the schema populated by DF after reading the XML file
>
> |-- id: long (nullable = true)
> |-- testexecutioncontroller: struct (nullable = true)
> | |-- execution-timeout: long (nullable = true)
> | |-- execution-method: string (nullable = true)
>
> While saving this dataframe to hive table I am getting below exception
>
> Caused by: java.lang.IllegalArgumentException: Error: : expected at the
> position 24 of 'bigint:struct<execution-timeout:bigint,execution-method:string>'
> but '-' is found. at org.apache.hadoop.hive.serde2.
> typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360)
> at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$
> TypeInfoParser.expect(TypeInfoUtils.java:331) at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$
> TypeInfoParser.parseType(TypeInfoUtils.java:483) at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$
> TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) at
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.
> getTypeInfosFromTypeString(TypeInfoUtils.java:765) at
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:111)
> at org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
> at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
> at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
> at org.apache.hadoop.hive.ql.metadata.Table.
> getDeserializerFromMetaStore(Table.java:276) at
> org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
> at org.apache
>
> It looks like the issue is happening due to special character '-' in the
> field. As after removing the special character it iw working properly.
>
> Could you please let me know if there is way to replaces all child column
> names so that it can be saved as table without any issue.
>
> Creating the STRUCT FIELD from df.schema and recursively creating another
> STRUCTFIELD with renamed column is one solution I am aware of. But still
> wanted to check if there is easy way to do this.
>
> Thanks,
> Abhijeet
>