You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Pierre Gramme (Jira)" <ji...@apache.org> on 2020/08/14 15:19:00 UTC
[jira] [Updated] (SPARK-32618) ORC writer doesn't support colon in column names

     [ https://issues.apache.org/jira/browse/SPARK-32618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pierre Gramme updated SPARK-32618:
----------------------------------
    Description: 
Hi,

I'm getting an {{IllegalArgumentException: Can't parse category at 'struct<a:b^:int>'}} when exporting to ORC a dataframe whose column names contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if the name with colon appears nested as member of a struct.

In my real-life case, the column was actually {{xsi:type}}, coming from some parsed xml. Thus other users may be affected too.

Has it been fixed after Spark 2.3.0? (sorry, can't test easily)

Any workaround? Would be acceptable for me to find and replace all colons with underscore in column names, but not easy to do in a big set of nested struct columns...

Thanks

 

 
{code:java}
 spark.conf.set("spark.sql.orc.impl", "native")

 val dfColon = Seq(1).toDF("a:b")
 dfColon.printSchema()
 dfColon.show()
 dfColon.write.orc("test_colon")
 // Fails with IllegalArgumentException: Can't parse category at 'struct<a:b^:int>'
 
 import org.apache.spark.sql.functions.struct
 val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b")
 dfColonStruct.printSchema()
 dfColonStruct.show()
 dfColon.write.orc("test_colon_struct")
 // Fails with IllegalArgumentException: Can't parse category at 'struct<x:struct<a:b^:int>>'
{code}
 

 

  was:
Hi,

I'm getting an {{IllegalArgumentException: Can't parse category at 'struct<a:b^:int>'}} when exporting to ORC a dataframe whose column names contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if the name with colon appears nested as member of a struct.

In my real-life case, the column was actually {{xsi:type}}, coming from some parsed xml. Thus other users may be affected too.

Has it been fixed after Spark 2.3.0? (sorry, can't test easily)

Any workaround? Would be acceptable for me to find and replace all colons with underscore in column names, but not easy to do in a big set of nested struct columns...

Thanks

 

 
{code:java}
 spark.conf.set("spark.sql.orc.impl", "native")

 val dfColon = Seq(1).toDF("a:b")
 dfColon.printSchema()
 dfColon.show()
 dfColon.write.orc("test_colon")
 // Fails with IllegalArgumentException: Can't parse category at 'struct<a:b^:int>'
 
 import org.apache.spark.sql.functions.struct
 val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b")
 dfColonStruct.printSchema()
 dfColonStruct.show()
 dfColon.write.orc("test_colon_struct")
 // Fails with IllegalArgumentException: Can't parse category at 'struct<a:b^:int>'
{code}
 

 


> ORC writer doesn't support colon in column names
> ------------------------------------------------
>
>                 Key: SPARK-32618
>                 URL: https://issues.apache.org/jira/browse/SPARK-32618
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.3.0
>            Reporter: Pierre Gramme
>            Priority: Major
>
> Hi,
> I'm getting an {{IllegalArgumentException: Can't parse category at 'struct<a:b^:int>'}} when exporting to ORC a dataframe whose column names contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if the name with colon appears nested as member of a struct.
> In my real-life case, the column was actually {{xsi:type}}, coming from some parsed xml. Thus other users may be affected too.
> Has it been fixed after Spark 2.3.0? (sorry, can't test easily)
> Any workaround? Would be acceptable for me to find and replace all colons with underscore in column names, but not easy to do in a big set of nested struct columns...
> Thanks
>  
>  
> {code:java}
>  spark.conf.set("spark.sql.orc.impl", "native")
>  val dfColon = Seq(1).toDF("a:b")
>  dfColon.printSchema()
>  dfColon.show()
>  dfColon.write.orc("test_colon")
>  // Fails with IllegalArgumentException: Can't parse category at 'struct<a:b^:int>'
>  
>  import org.apache.spark.sql.functions.struct
>  val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b")
>  dfColonStruct.printSchema()
>  dfColonStruct.show()
>  dfColon.write.orc("test_colon_struct")
>  // Fails with IllegalArgumentException: Can't parse category at 'struct<x:struct<a:b^:int>>'
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org