You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Terry Moschou (Jira)" <ji...@apache.org> on 2019/10/11 04:32:00 UTC
[jira] [Commented] (SPARK-28008) Default values & column comments in AVRO schema converters

    [ https://issues.apache.org/jira/browse/SPARK-28008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949125#comment-16949125 ] 

Terry Moschou commented on SPARK-28008:
---------------------------------------

We also have a use case for propagating application specific metadata other than {{comment}}, that is currently being dropped by {{SchemaConverters}}. The Avro [specification|http://avro.apache.org/docs/current/spec.html#schemas] does support user-defined attributes whose names are not reserved:

bq. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.

Some something like a {{"metadata"}} key would work. I guess {{doc}} could be a shortcut for {{metadata.comment}}?

{code:json}
{
  "type": "record",
  "name": "topLevelRecord",
  "fields": [
    {
      "name": "a",
      "type": "string",
      "metadata": {
        "comment": "AAAAAAA",
        "foo": "bar"
      }
    }
  ]
}
{code}

> Default values & column comments in AVRO schema converters
> ----------------------------------------------------------
>
>                 Key: SPARK-28008
>                 URL: https://issues.apache.org/jira/browse/SPARK-28008
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Mathew Wicks
>            Priority: Major
>
> Currently in both `toAvroType` and `toSqlType` [SchemaConverters.scala#L134|https://github.com/apache/spark/blob/branch-2.4/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L134] there are two behaviours which are unexpected.
> h2. Nullable fields in spark are converted to UNION[TYPE, NULL] and no default value is set:
> *Current Behaviour:*
> {code:java}
> import org.apache.spark.sql.avro.SchemaConverters
> import org.apache.spark.sql.types._
> val schema = new StructType().add("a", "string", nullable = true)
> val avroSchema = SchemaConverters.toAvroType(schema)
> println(avroSchema.toString(true))
> {
>   "type" : "record",
>   "name" : "topLevelRecord",
>   "fields" : [ {
>     "name" : "a",
>     "type" : [ "string", "null" ]
>   } ]
> }
> {code}
> *Expected Behaviour:*
> (NOTE: The reversal of "null" & "string" in the union, needed for a default value of null)
> {code:java}
> import org.apache.spark.sql.avro.SchemaConverters
> import org.apache.spark.sql.types._
> val schema = new StructType().add("a", "string", nullable = true)
> val avroSchema = SchemaConverters.toAvroType(schema)
> println(avroSchema.toString(true))
> {
>   "type" : "record",
>   "name" : "topLevelRecord",
>   "fields" : [ {
>     "name" : "a",
>     "type" : [ "null", "string" ],
>     "default" : null
>   } ]
> }{code}
> h2. Field comments/metadata is not propagated:
> *Current Behaviour:*
> {code:java}
> import org.apache.spark.sql.avro.SchemaConverters
> import org.apache.spark.sql.types._
> val schema = new StructType().add("a", "string", nullable=false, comment="AAAAAAA")
> val avroSchema = SchemaConverters.toAvroType(schema)
> println(avroSchema.toString(true))
> {
>   "type" : "record",
>   "name" : "topLevelRecord",
>   "fields" : [ {
>     "name" : "a",
>     "type" : "string"
>   } ]
> }{code}
> *Expected Behaviour:*
> {code:java}
> import org.apache.spark.sql.avro.SchemaConverters
> import org.apache.spark.sql.types._
> val schema = new StructType().add("a", "string", nullable=false, comment="AAAAAAA")
> val avroSchema = SchemaConverters.toAvroType(schema)
> println(avroSchema.toString(true))
> {
>   "type" : "record",
>   "name" : "topLevelRecord",
>   "fields" : [ {
>     "name" : "a",
>     "type" : "string",
>     "doc" : "AAAAAAA"
>   } ]
> }{code}
>  
> The behaviour should be similar (but the reverse) for `toSqlType`.
> I think we should aim to get this in before 3.0, as it will probably be a breaking change for some usage of the AVRO API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org