You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sai kiran Krishna murthy (Jira)" <ji...@apache.org> on 2020/06/29 20:53:00 UTC
[jira] [Commented] (SPARK-32122) Exception while writing dataframe with enum fields

    [ https://issues.apache.org/jira/browse/SPARK-32122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17148128#comment-17148128 ] 

Sai kiran Krishna murthy commented on SPARK-32122:
--------------------------------------------------

Looks like the issue is resolved in spark 3.0.0

> Exception while writing dataframe with enum fields
> --------------------------------------------------
>
>                 Key: SPARK-32122
>                 URL: https://issues.apache.org/jira/browse/SPARK-32122
>             Project: Spark
>          Issue Type: Question
>          Components: SQL
>    Affects Versions: 2.4.3
>            Reporter: Sai kiran Krishna murthy
>            Priority: Minor
>              Labels: avro, spark-sql
>
> I have an avro schema with one field which is an enum and I am trying to enforce this schema when I am writing my dataframe, the code looks something like this
> {code:java}
> case class Name1(id:String,count:Int,val_type:String)
> val schema = """{
>                  |  "type" : "record",
>                  |  "name" : "name1",
>                  |  "namespace" : "com.data",
>                  |  "fields" : [
>                  |  {
>                  |    "name" : "id",
>                  |    "type" : "string"
>                  |  },
>                  |  {
>                  |    "name" : "count",
>                  |    "type" : "int"
>                  |  },
>                  |  {
>                  |    "name" : "val_type",
>                  |    "type" : {
>                  |      "type" : "enum",
>                  |      "name" : "ValType",
>                  |      "symbols" : [ "s1", "s2" ]
>                  |    }
>                  |  }
>                  |  ]
>                  |}""".stripMargin
> val df = Seq(
>             Name1("1",2,"s1"),
>             Name1("1",3,"s2"),
>             Name1("1",4,"s2"),
>             Name1("11",2,"s1")).toDF()
> df.write.format("avro").option("avroSchema",schema).save("data/tes2/")
> {code}
> This code fails with the following exception,
>  
> {noformat}
> 2020-06-28 23:28:10 ERROR Utils:91 - Aborting task
> org.apache.avro.AvroRuntimeException: Not a union: "string"
> 	at org.apache.avro.Schema.getTypes(Schema.java:299)
> 	at org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
> 	at org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
> 	at org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:208)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> 	at scala.collection.immutable.List.foreach(List.scala:392)
> 	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> 	at scala.collection.immutable.List.map(List.scala:296)
> 	at org.apache.spark.sql.avro.AvroSerializer.newStructConverter(AvroSerializer.scala:208)
> 	at org.apache.spark.sql.avro.AvroSerializer.<init>(AvroSerializer.scala:51)
> 	at org.apache.spark.sql.avro.AvroOutputWriter.serializer$lzycompute(AvroOutputWriter.scala:42)
> 	at org.apache.spark.sql.avro.AvroOutputWriter.serializer(AvroOutputWriter.scala:42)
> 	at org.apache.spark.sql.avro.AvroOutputWriter.write(AvroOutputWriter.scala:64)
> 	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137)
> 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:245)
> 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
> 	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
> 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
> 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
> 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:121)
> 	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> 2020-06-28 23:28:10 ERROR Utils:91 - Aborting task{noformat}
>  
> I understand this is because of the type of val_type is  `String` in the case class. Can you please advice how I can solve this problem without having to change the underlying avro schema? 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org