You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@avro.apache.org by "Evan Blackwell (Jira)" <ji...@apache.org> on 2023/06/14 23:35:00 UTC

[jira] [Reopened] (AVRO-3772) [Rust] Deserialize Errors for an Unknown Enum Symbol instead of Returning Default

     [ https://issues.apache.org/jira/browse/AVRO-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Evan Blackwell reopened AVRO-3772:
----------------------------------

It looks like the test cases I gave did pass, but I was doing some reading on defaults in avro and am now wondering if I had the default in the right spot. It looks like there's a difference between a field default and a symbol default according to this article [https://medium.com/expedia-group-tech/safety-considerations-when-using-enums-in-avro-schemas-82e18baaa081] 

 

If the reader schema looked like this, I think the symbol default would be diamonds and the field default would be spades. It seems the symbol default is intended for use with forward compatibility if we read a symbol we don't recognize (e.g. if we read "ninjas," we would resolve to "diamonds"), and the field default is for backwards compatibility if we read a message doesn't have the field at all (e.g. if an old schema doesn't have field c, we default to "spades").
{code:java}
{
            "type": "record",
            "name": "test",
            "fields": [
                {"name": "a", "type": "long", "default": 42},
                {"name": "b", "type": "string"},
                {
                    "name": "c",
                    "type": {
                        "type": "enum",
                        "name": "suit",
                        "symbols": ["diamonds", "spades", "ninja", "hearts"],
                        "default": "diamonds"
                    },
                    "default": "spades"
                }
            ]
        } {code}
It seems like I might have asked you to make the field default operate the way that the symbol default is supposed to be used. Maybe this actually should have been the test
{code:java}
    #[test]
    fn avro_3772_enum_default() -> TestResult {
        let writer_raw_schema = r#"
        {
          "type": "record",
          "name": "test",
          "fields": [
            {"name": "a", "type": "long", "default": 42},
            {"name": "b", "type": "string"},
            {
              "name": "c",
              "type": {
                "type": "enum",
                "name": "suit",
                "symbols": ["diamonds", "spades", "clubs", "hearts"]
              },
              "default": "spades"
            }
          ]
        }
        "#;        let reader_raw_schema = r#"
        {
          "type": "record",
          "name": "test",
          "fields": [
            {"name": "a", "type": "long", "default": 42},
            {"name": "b", "type": "string"},
            {
              "name": "c",
              "type": {
                  "type": "enum",
                  "name": "suit",
                  "symbols": ["diamonds", "spades", "ninja", "hearts"],
                  "default": "spades"
               }
            }
          ]
        }
      "#;
        let writer_schema = Schema::parse_str(writer_raw_schema)?;
        let reader_schema = Schema::parse_str(reader_raw_schema)?;
        let mut writer = Writer::with_codec(&writer_schema, Vec::new(), Codec::Null);
        let mut record = Record::new(writer.schema()).unwrap();
        record.put("a", 27i64);
        record.put("b", "foo");
        record.put("c", "clubs");
        writer.append(record).unwrap();
        let input = writer.into_inner()?;
        let mut reader = Reader::with_schema(&reader_schema, &input[..])?;
        assert_eq!(
            reader.next().unwrap().unwrap(),
            Value::Record(vec![
                ("a".to_string(), Value::Long(27)),
                ("b".to_string(), Value::String("foo".to_string())),
                ("c".to_string(), Value::Enum(1, "spades".to_string())),
            ])
        );
        assert!(reader.next().is_none());
        Ok(())
    } {code}

> [Rust] Deserialize Errors for an Unknown Enum Symbol instead of Returning Default
> ---------------------------------------------------------------------------------
>
>                 Key: AVRO-3772
>                 URL: https://issues.apache.org/jira/browse/AVRO-3772
>             Project: Apache Avro
>          Issue Type: Bug
>          Components: rust
>    Affects Versions: 1.9.0, 1.10.0, 1.9.1, 1.9.2, 1.11.0, 1.10.1, 1.10.2, 1.11.1
>            Reporter: Evan Blackwell
>            Assignee: Martin Tzvetanov Grigorov
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.12.0, 1.11.2
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The rust implementation of avro appears to behave according to an old specification when deserializing messages with a symbol in the writer's enum that is not in the reader's enum. The deserializer is returning an error even if the reader schema specifies a default. Starting in version 1.9.0, the default should be used in this situation according to this section of the documentation about schema resolution: 
> [https://avro.apache.org/docs/current/spec.html#Schema+Resolution]
> {code:java}
> if both are enums:
> if the writer's symbol is not present in the reader's enum and the reader has a default value, then that value is used, otherwise an error is signalled. {code}
>  
> This test currently in master expects the deserializer to error because it was written 5 years ago before spec 1.9.0 was released. (summary of the test: writer has symbol "clubs," the reader does not have the "clubs" symbol but does have a default, record gets made with symbol "clubs," the result is an error)
> [https://github.com/apache/avro/blob/6f4162e3d71e602bc398563b102d569846d5f39f/lang/rust/avro/src/lib.rs#L871]
>  
> {code:java}
> #[test]
> fn test_enum_resolution() {
>     let writer_raw_schema = r#"
>         {
>             "type": "record",
>             "name": "test",
>             "fields": [
>                 {"name": "a", "type": "long", "default": 42},
>                 {"name": "b", "type": "string"},
>                 {
>                     "name": "c",
>                     "type": {
>                         "type": "enum",
>                         "name": "suit",
>                         "symbols": ["diamonds", "spades", "clubs", "hearts"]
>                     },
>                     "default": "spades"
>                 }
>             ]
>         }
>     "#;
>     let reader_raw_schema = r#"
>         {
>             "type": "record",
>             "name": "test",
>             "fields": [
>                 {"name": "a", "type": "long", "default": 42},
>                 {"name": "b", "type": "string"},
>                 {
>                     "name": "c",
>                     "type": {
>                         "type": "enum",
>                         "name": "suit",
>                             "symbols": ["diamonds", "spades", "ninja", "hearts"]
>                     },
>                     "default": "spades"
>                 }
>             ]
>         }
>     "#;
>     let writer_schema = Schema::parse_str(writer_raw_schema).unwrap();
>     let reader_schema = Schema::parse_str(reader_raw_schema).unwrap();
>     let mut writer = Writer::with_codec(&writer_schema, Vec::new(), Codec::Null);
>     let mut record = Record::new(writer.schema()).unwrap();
>     record.put("a", 27i64);
>     record.put("b", "foo");
>     record.put("c", "clubs");
>     writer.append(record).unwrap();
>     let input = writer.into_inner().unwrap();
>     let mut reader = Reader::with_schema(&reader_schema, &input[..]).unwrap();
>     assert!(reader.next().unwrap().is_err());
>     assert!(reader.next().is_none());
> } {code}
>  
>  
> If the deserializer was correctly using the default, I would expect the last two lines to instead assert the first two values were as expected with c defaulting to "spades"
>  
> {code:java}
> assert_eq!(
>     reader.next().unwrap().unwrap(),
>     Value::Record(vec![
>         ("a".to_string(), Value::Long(27)),
>         ("b".to_string(), Value::String("foo".to_string())),
>         ("c".to_string(), Value::Enum(1, "spades".to_string())),
>     ])
> );
> assert!(reader.next().is_none()); {code}
>  
> The error returned is GetDefaultEnum, and it seems to be getting created here when we recognize that the symbol is not in the list of symbols that the reader has for the enum. 
> [https://github.com/apache/avro/blob/6f4162e3d71e602bc398563b102d569846d5f39f/lang/rust/avro/src/types.rs#L848]
>  
> {code:java}
> fn resolve_enum(self, symbols: &[String]) -> Result<Self, Error> {
>         let validate_symbol = |symbol: String, symbols: &[String]| {
>             if let Some(index) = symbols.iter().position(|item| item == &symbol) {
>                 Ok(Value::Enum(index as u32, symbol))
>             } else {
>                 Err(Error::GetEnumDefault {
>                     symbol,
>                     symbols: symbols.into(),
>                 })
>             }
>         };
>         match self {
>             Value::Enum(raw_index, s) => {
>                 let index = usize::try_from(raw_index)
>                     .map_err(|e| Error::ConvertU32ToUsize(e, raw_index))?;
>                 if (0..=symbols.len()).contains(&index) {
>                     validate_symbol(s, symbols)
>                 } else {
>                     Err(Error::GetEnumValue {
>                         index,
>                         nsymbols: symbols.len(),
>                     })
>                 }
>             }
>             Value::String(s) => validate_symbol(s, symbols),
>             other => Err(Error::GetEnum(other.into())),
>         }
>     } {code}
>  
> To follow the specification, it seems that instead of always returning the GetEnumDefault error if the symbol doesn't exist in the reader enum, we should first check if there is a default to use and return it if possible. Or the caller should check for this error and use the default if available.
>  
> I'm relatively new to avro, so if I'm misunderstanding either the specification or the code behavior please let me know. Thanks!
>  
> (I'm also curious if there would be another issue if the record contained a symbol that exists in the writer enum and had an index greater than the len of the reader's symbols since it appears to be doing a check for that on line 859).
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)