You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Scott (Jira)" <ji...@apache.org> on 2021/01/21 03:58:00 UTC
[jira] [Created] (AVRO-3029) Specification is a little ambiguous about where enum defaults should be defined which might be causing library differences

Scott created AVRO-3029:
---------------------------

             Summary: Specification is a little ambiguous about where enum defaults should be defined which might be causing library differences
                 Key: AVRO-3029
                 URL: https://issues.apache.org/jira/browse/AVRO-3029
             Project: Apache Avro
          Issue Type: Improvement
          Components: java, python, ruby
    Affects Versions: 1.10.1
            Reporter: Scott


In the specification, an enum type can have a `default` attribute. At the same time, each field in a record can have a default. On top of that, the chart of example default values for fields includes enum and the example.

So, if I want to define a record with a enum field, where would I put the default? Do I define it like this:
{code:java}
{
    "type": "record",
    "name": "test",
    "fields": [
        {
            "name": "enum",
            "type": {
                "type": "enum",
                "name": "enum_field",
                "symbols": ["FOO", "BAR"],
            },
            "default": "FOO",
        },
    ],
}
{code}

Or like this:
{code:java}
{
    "type": "record",
    "name": "test",
    "fields": [
        {
            "name": "enum",
            "type": {
                "type": "enum",
                "name": "enum_field",
                "symbols": ["FOO", "BAR"],
                "default": "FOO",
            },
        },
    ],
}
{code}

I was confused, so I started looking for examples, but it seems like I'm not the only one confused about this because [this stackoverflow|https://stackoverflow.com/questions/62596990/avro-schema-evolution-with-enum-deserialization-crashes] and this Jira ticket put the default at the field level whereas this Jira ticket puts the default at the enum level.

So then I started looking at examples in the codebase. It seems like there's a [ruby test case|https://github.com/apache/avro/blob/7d1e63b219e6d0778bc57195152477adee97fcab/lang/ruby/test/test_schema.rb#L333-L338] and [java test case|https://github.com/apache/avro/blob/7d1e63b219e6d0778bc57195152477adee97fcab/lang/java/avro/src/test/java/org/apache/avro/FooBarSpecificRecord.java#L34] that put the default at the enum level.

Okay, solved, right? Since the test cases have the default at the enum level, that's where it should be... but then I tried to create a simple python script (since I'm a python user) to double check this, and it seems like the python library disagrees. Here's the example script that uses the default at the enum level:
{code:java}
import json
from io import BytesIO
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

writer_schema = {
    "type": "record",
    "name": "test",
    "fields": [
        {
            "name": "foo",
            "type": "string"
        }
    ],
}

reader_schema = {
    "type": "record",
    "name": "test",
    "fields": [
        {
            "name": "foo",
            "type": "string"
        },
        {
            "name": "enum",
            "type": {
                "type": "enum",
                "name": "enum_field",
                "symbols": ["FOO", "BAR"],
                "default": "FOO",
            },
        },
    ],
}

w_schema = avro.schema.parse(json.dumps(writer_schema))
r_schema = avro.schema.parse(json.dumps(reader_schema))

bio = BytesIO()

writer = DataFileWriter(bio, DatumWriter(), w_schema)
writer.append({"foo": "bar"})
writer.flush()

bio.seek(0)

reader = DataFileReader(bio, DatumReader(w_schema, r_schema))
for record in reader:
    print(record)
{code}
But when I run that, I get an exception:
{code:java}
avro.io.SchemaResolutionException: No default value for field enum
Writer's Schema: {
  "type": "record",
  "name": "test",
  "fields": [
    {
      "type": "string",
      "name": "foo"
    }
  ]
}
Reader's Schema: {
  "type": "record",
  "name": "test",
  "fields": [
    {
      "type": "string",
      "name": "foo"
    },
    {
      "type": {
        "type": "enum",
        "default": "FOO",
        "name": "enum_field",
        "symbols": [
          "FOO",
          "BAR"
        ]
      },
      "name": "enum"
    }
  ]
}
{code}
And if I change the script to use a reader_schema that has the default on the field level like this:
{code:java}
reader_schema = {
    "type": "record",
    "name": "test",
    "fields": [
        {
            "name": "foo",
            "type": "string"
        },
        {
            "name": "enum",
            "type": {
                "type": "enum",
                "name": "enum_field",
                "symbols": ["FOO", "BAR"],
            },
            "default": "FOO",
        },
    ],
}
{code}
Then it works and prints out the record with the default value for the enum:
{code:java}
{'foo': 'bar', 'enum': 'FOO'}
{code}

I don't have a Java environment set up to try to run the same type of script in Java to verify that implementation, but I would assume based on the test case that it works exactly the opposite and expects the default at the enum level.

I think making the libraries consistent could cause massive breakages for whichever library doesn't currently conform to what the specification should be (which I'm honestly not sure based on how the spec is currently written). Therefore, I think it might be easiest to allow an enum's default to be defined at either the field level or the enum level. I maintain the `fastavro` library and the behavior there is the same as the avro python implementation and I would hate to have to force a massive breaking change like this on the users if the specification is updated to say that enum default values have to be defined at the enum level rather than the field level.

Please let me know your thoughts and thank you for taking the time to read this lengthy message.






 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)