You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Edgar H <ka...@gmail.com> on 2019/08/06 08:47:33 UTC

Avro schema having Map of Records

I'm trying to translate a schema that I have in Spark which is defined for
Parquet, and I would like to use it within Avro too.

  StructField("one_level", StructType(List(StructField(
    "inner_level",
    MapType(
      StringType,
      StructType(
        List(
          StructField("field1", StringType),
          StructField("field2", ArrayType(StringType))
        )
      )
    )
  )
)), nullable = false)

However, in Avro I haven't seen any examples of Maps containing Record type
objects...

Tried a sample input with an online Avro schema generator, taking this
input.

{
"one_level": {
    "inner_level": {
        "sample1": {
            "field1": "sample",
            "field2": ["a", "b"],
        },
        "sample2": {
            "field1": "sample2",
            "field2": ["a", "b"]
        }
    }
}

}

It prompts this output.

    {
  "name": "MyClass",
  "type": "record",
  "namespace": "com.acme.avro",
  "fields": [
    {
      "name": "one_level",
      "type": {
        "name": "one_level",
        "type": "record",
        "fields": [
          {
            "name": "inner_level",
            "type": {
              "name": "inner_level",
              "type": "record",
              "fields": [
                {
                  "name": "sample1",
                  "type": {
                    "name": "sample1",
                    "type": "record",
                    "fields": [
                      {
                        "name": "field1",
                        "type": "string"
                      },
                      {
                        "name": "field2",
                        "type": {
                          "type": "array",
                          "items": "string"
                        }
                      }
                    ]
                  }
                },
                {
                  "name": "sample2",
                  "type": {
                    "name": "sample2",
                    "type": "record",
                    "fields": [
                      {
                        "name": "field1",
                        "type": "string"
                      },
                      {
                        "name": "field2",
                        "type": {
                          "type": "array",
                          "items": "string"
                        }
                      }
                    ]
                  }
                }
              ]
            }
          }
        ]
      }
    }
  ]
}

Which isn't absolutely what I'm looking for. Is it possible to define such
schema in Avro?

Re: Avro schema having Map of Records

Posted by Edgar H <ka...@gmail.com>.
Seems like the right time to share some Parquet vs Avro knowledge haha :)

My god, exactly what you said! Untyped List within a POJO, problem solved.
BTW, it was using ReflectData.getSchema().

Thanks a lot Ryan! Really appreciated!

El mar., 6 ago. 2019 a las 17:35, Ryan Skraba (<ry...@skraba.com>) escribió:

> Funny, I'm familiar with Avro, but I'm currently looking closely at
> Parquet!
>
> Interestingly enough, I just ran across the conversion utilities in
> Spark that could have answered your original question[1].
>
> It looks like you're using ReflectData to get the schema.  Is the
> exception occurring during the ReflectData.getSchema() or .induce() ?
> Can you share the full stack trace or better yet, the POJO that
> reproduces the error?
>
> I _think_ I may have ran across something similar when getting a
> schema via reflection, but the class had a raw collection field (List
> instead of List<MyValue>).  I can't clearly recall, but that might be
> a useful hint.
>
> [1]:
> https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L136
>
> On Tue, Aug 6, 2019 at 2:39 PM Edgar H <ka...@gmail.com> wrote:
> >
> > Thanks a lot for the quick reply Ryan! That was exactly what I was
> looking for :)
> >
> > Been trying including the changes within my code and currently it's
> throwing the following exception... Caused by:
> org.apache.avro.AvroRuntimeException: Can't find element type of Collection
> >
> > I'm thinking that it could be the POJO not containing the classes for
> the inner record fields (I just have a getter and setter for the one_level
> field but the rest are types of that one)? Or how should it be represented
> within the parent POJO?
> >
> > Sorry if the questions sound too simple, but I'm too used to work with
> Parquet that Avro seems like a shift from time to time :)
> >
> > El mar., 6 ago. 2019 a las 12:01, Ryan Skraba (<ry...@skraba.com>)
> escribió:
> >>
> >> Hello -- Avro supports a map type:
> >> https://avro.apache.org/docs/1.9.0/spec.html#Maps
> >>
> >> Generating an Avro schema from a JSON example can be ambiguous since a
> >> JSON object can either be converted to a record or a map.  You're
> >> probably looking for something like this:
> >>
> >> {
> >>   "type" : "record",
> >>   "name" : "MyClass",
> >>   "namespace" : "com.acme.avro",
> >>   "fields" : [ {
> >>     "name" : "one_level",
> >>     "type" : {
> >>       "type" : "record",
> >>       "name" : "one_level",
> >>       "fields" : [ {
> >>         "name" : "inner_level",
> >>         "type" : {
> >>           "type" : "map",
> >>           "values" : {
> >>             "type" : "record",
> >>             "name" : "sample",
> >>             "fields" : [ {
> >>               "name" : "sample1",
> >>               "type" : "string"
> >>             }, {
> >>               "name" : "sample2",
> >>               "type" : "string"
> >>             } ]
> >>           }
> >>         }
> >>       } ]
> >>     }
> >>   } ]
> >> }
> >>
> >> On Tue, Aug 6, 2019 at 10:47 AM Edgar H <ka...@gmail.com> wrote:
> >> >
> >> > I'm trying to translate a schema that I have in Spark which is
> defined for Parquet, and I would like to use it within Avro too.
> >> >
> >> >   StructField("one_level", StructType(List(StructField(
> >> >     "inner_level",
> >> >     MapType(
> >> >       StringType,
> >> >       StructType(
> >> >         List(
> >> >           StructField("field1", StringType),
> >> >           StructField("field2", ArrayType(StringType))
> >> >         )
> >> >       )
> >> >     )
> >> >   )
> >> > )), nullable = false)
> >> >
> >> > However, in Avro I haven't seen any examples of Maps containing
> Record type objects...
> >> >
> >> > Tried a sample input with an online Avro schema generator, taking
> this input.
> >> >
> >> > {
> >> > "one_level": {
> >> >     "inner_level": {
> >> >         "sample1": {
> >> >             "field1": "sample",
> >> >             "field2": ["a", "b"],
> >> >         },
> >> >         "sample2": {
> >> >             "field1": "sample2",
> >> >             "field2": ["a", "b"]
> >> >         }
> >> >     }
> >> > }
> >> >
> >> > }
> >> >
> >> > It prompts this output.
> >> >
> >> >     {
> >> >   "name": "MyClass",
> >> >   "type": "record",
> >> >   "namespace": "com.acme.avro",
> >> >   "fields": [
> >> >     {
> >> >       "name": "one_level",
> >> >       "type": {
> >> >         "name": "one_level",
> >> >         "type": "record",
> >> >         "fields": [
> >> >           {
> >> >             "name": "inner_level",
> >> >             "type": {
> >> >               "name": "inner_level",
> >> >               "type": "record",
> >> >               "fields": [
> >> >                 {
> >> >                   "name": "sample1",
> >> >                   "type": {
> >> >                     "name": "sample1",
> >> >                     "type": "record",
> >> >                     "fields": [
> >> >                       {
> >> >                         "name": "field1",
> >> >                         "type": "string"
> >> >                       },
> >> >                       {
> >> >                         "name": "field2",
> >> >                         "type": {
> >> >                           "type": "array",
> >> >                           "items": "string"
> >> >                         }
> >> >                       }
> >> >                     ]
> >> >                   }
> >> >                 },
> >> >                 {
> >> >                   "name": "sample2",
> >> >                   "type": {
> >> >                     "name": "sample2",
> >> >                     "type": "record",
> >> >                     "fields": [
> >> >                       {
> >> >                         "name": "field1",
> >> >                         "type": "string"
> >> >                       },
> >> >                       {
> >> >                         "name": "field2",
> >> >                         "type": {
> >> >                           "type": "array",
> >> >                           "items": "string"
> >> >                         }
> >> >                       }
> >> >                     ]
> >> >                   }
> >> >                 }
> >> >               ]
> >> >             }
> >> >           }
> >> >         ]
> >> >       }
> >> >     }
> >> >   ]
> >> > }
> >> >
> >> > Which isn't absolutely what I'm looking for. Is it possible to define
> such schema in Avro?
>

Re: Avro schema having Map of Records

Posted by Ryan Skraba <ry...@skraba.com>.
Funny, I'm familiar with Avro, but I'm currently looking closely at Parquet!

Interestingly enough, I just ran across the conversion utilities in
Spark that could have answered your original question[1].

It looks like you're using ReflectData to get the schema.  Is the
exception occurring during the ReflectData.getSchema() or .induce() ?
Can you share the full stack trace or better yet, the POJO that
reproduces the error?

I _think_ I may have ran across something similar when getting a
schema via reflection, but the class had a raw collection field (List
instead of List<MyValue>).  I can't clearly recall, but that might be
a useful hint.

[1]: https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L136

On Tue, Aug 6, 2019 at 2:39 PM Edgar H <ka...@gmail.com> wrote:
>
> Thanks a lot for the quick reply Ryan! That was exactly what I was looking for :)
>
> Been trying including the changes within my code and currently it's throwing the following exception... Caused by: org.apache.avro.AvroRuntimeException: Can't find element type of Collection
>
> I'm thinking that it could be the POJO not containing the classes for the inner record fields (I just have a getter and setter for the one_level field but the rest are types of that one)? Or how should it be represented within the parent POJO?
>
> Sorry if the questions sound too simple, but I'm too used to work with Parquet that Avro seems like a shift from time to time :)
>
> El mar., 6 ago. 2019 a las 12:01, Ryan Skraba (<ry...@skraba.com>) escribió:
>>
>> Hello -- Avro supports a map type:
>> https://avro.apache.org/docs/1.9.0/spec.html#Maps
>>
>> Generating an Avro schema from a JSON example can be ambiguous since a
>> JSON object can either be converted to a record or a map.  You're
>> probably looking for something like this:
>>
>> {
>>   "type" : "record",
>>   "name" : "MyClass",
>>   "namespace" : "com.acme.avro",
>>   "fields" : [ {
>>     "name" : "one_level",
>>     "type" : {
>>       "type" : "record",
>>       "name" : "one_level",
>>       "fields" : [ {
>>         "name" : "inner_level",
>>         "type" : {
>>           "type" : "map",
>>           "values" : {
>>             "type" : "record",
>>             "name" : "sample",
>>             "fields" : [ {
>>               "name" : "sample1",
>>               "type" : "string"
>>             }, {
>>               "name" : "sample2",
>>               "type" : "string"
>>             } ]
>>           }
>>         }
>>       } ]
>>     }
>>   } ]
>> }
>>
>> On Tue, Aug 6, 2019 at 10:47 AM Edgar H <ka...@gmail.com> wrote:
>> >
>> > I'm trying to translate a schema that I have in Spark which is defined for Parquet, and I would like to use it within Avro too.
>> >
>> >   StructField("one_level", StructType(List(StructField(
>> >     "inner_level",
>> >     MapType(
>> >       StringType,
>> >       StructType(
>> >         List(
>> >           StructField("field1", StringType),
>> >           StructField("field2", ArrayType(StringType))
>> >         )
>> >       )
>> >     )
>> >   )
>> > )), nullable = false)
>> >
>> > However, in Avro I haven't seen any examples of Maps containing Record type objects...
>> >
>> > Tried a sample input with an online Avro schema generator, taking this input.
>> >
>> > {
>> > "one_level": {
>> >     "inner_level": {
>> >         "sample1": {
>> >             "field1": "sample",
>> >             "field2": ["a", "b"],
>> >         },
>> >         "sample2": {
>> >             "field1": "sample2",
>> >             "field2": ["a", "b"]
>> >         }
>> >     }
>> > }
>> >
>> > }
>> >
>> > It prompts this output.
>> >
>> >     {
>> >   "name": "MyClass",
>> >   "type": "record",
>> >   "namespace": "com.acme.avro",
>> >   "fields": [
>> >     {
>> >       "name": "one_level",
>> >       "type": {
>> >         "name": "one_level",
>> >         "type": "record",
>> >         "fields": [
>> >           {
>> >             "name": "inner_level",
>> >             "type": {
>> >               "name": "inner_level",
>> >               "type": "record",
>> >               "fields": [
>> >                 {
>> >                   "name": "sample1",
>> >                   "type": {
>> >                     "name": "sample1",
>> >                     "type": "record",
>> >                     "fields": [
>> >                       {
>> >                         "name": "field1",
>> >                         "type": "string"
>> >                       },
>> >                       {
>> >                         "name": "field2",
>> >                         "type": {
>> >                           "type": "array",
>> >                           "items": "string"
>> >                         }
>> >                       }
>> >                     ]
>> >                   }
>> >                 },
>> >                 {
>> >                   "name": "sample2",
>> >                   "type": {
>> >                     "name": "sample2",
>> >                     "type": "record",
>> >                     "fields": [
>> >                       {
>> >                         "name": "field1",
>> >                         "type": "string"
>> >                       },
>> >                       {
>> >                         "name": "field2",
>> >                         "type": {
>> >                           "type": "array",
>> >                           "items": "string"
>> >                         }
>> >                       }
>> >                     ]
>> >                   }
>> >                 }
>> >               ]
>> >             }
>> >           }
>> >         ]
>> >       }
>> >     }
>> >   ]
>> > }
>> >
>> > Which isn't absolutely what I'm looking for. Is it possible to define such schema in Avro?

Re: Avro schema having Map of Records

Posted by Edgar H <ka...@gmail.com>.
Thanks a lot for the quick reply Ryan! That was exactly what I was looking
for :)

Been trying including the changes within my code and currently it's
throwing the following exception... Caused by:
org.apache.avro.AvroRuntimeException: Can't find element type of Collection

I'm thinking that it could be the POJO not containing the classes for the
inner record fields (I just have a getter and setter for the one_level
field but the rest are types of that one)? Or how should it be represented
within the parent POJO?

Sorry if the questions sound too simple, but I'm too used to work with
Parquet that Avro seems like a shift from time to time :)

El mar., 6 ago. 2019 a las 12:01, Ryan Skraba (<ry...@skraba.com>) escribió:

> Hello -- Avro supports a map type:
> https://avro.apache.org/docs/1.9.0/spec.html#Maps
>
> Generating an Avro schema from a JSON example can be ambiguous since a
> JSON object can either be converted to a record or a map.  You're
> probably looking for something like this:
>
> {
>   "type" : "record",
>   "name" : "MyClass",
>   "namespace" : "com.acme.avro",
>   "fields" : [ {
>     "name" : "one_level",
>     "type" : {
>       "type" : "record",
>       "name" : "one_level",
>       "fields" : [ {
>         "name" : "inner_level",
>         "type" : {
>           "type" : "map",
>           "values" : {
>             "type" : "record",
>             "name" : "sample",
>             "fields" : [ {
>               "name" : "sample1",
>               "type" : "string"
>             }, {
>               "name" : "sample2",
>               "type" : "string"
>             } ]
>           }
>         }
>       } ]
>     }
>   } ]
> }
>
> On Tue, Aug 6, 2019 at 10:47 AM Edgar H <ka...@gmail.com> wrote:
> >
> > I'm trying to translate a schema that I have in Spark which is defined
> for Parquet, and I would like to use it within Avro too.
> >
> >   StructField("one_level", StructType(List(StructField(
> >     "inner_level",
> >     MapType(
> >       StringType,
> >       StructType(
> >         List(
> >           StructField("field1", StringType),
> >           StructField("field2", ArrayType(StringType))
> >         )
> >       )
> >     )
> >   )
> > )), nullable = false)
> >
> > However, in Avro I haven't seen any examples of Maps containing Record
> type objects...
> >
> > Tried a sample input with an online Avro schema generator, taking this
> input.
> >
> > {
> > "one_level": {
> >     "inner_level": {
> >         "sample1": {
> >             "field1": "sample",
> >             "field2": ["a", "b"],
> >         },
> >         "sample2": {
> >             "field1": "sample2",
> >             "field2": ["a", "b"]
> >         }
> >     }
> > }
> >
> > }
> >
> > It prompts this output.
> >
> >     {
> >   "name": "MyClass",
> >   "type": "record",
> >   "namespace": "com.acme.avro",
> >   "fields": [
> >     {
> >       "name": "one_level",
> >       "type": {
> >         "name": "one_level",
> >         "type": "record",
> >         "fields": [
> >           {
> >             "name": "inner_level",
> >             "type": {
> >               "name": "inner_level",
> >               "type": "record",
> >               "fields": [
> >                 {
> >                   "name": "sample1",
> >                   "type": {
> >                     "name": "sample1",
> >                     "type": "record",
> >                     "fields": [
> >                       {
> >                         "name": "field1",
> >                         "type": "string"
> >                       },
> >                       {
> >                         "name": "field2",
> >                         "type": {
> >                           "type": "array",
> >                           "items": "string"
> >                         }
> >                       }
> >                     ]
> >                   }
> >                 },
> >                 {
> >                   "name": "sample2",
> >                   "type": {
> >                     "name": "sample2",
> >                     "type": "record",
> >                     "fields": [
> >                       {
> >                         "name": "field1",
> >                         "type": "string"
> >                       },
> >                       {
> >                         "name": "field2",
> >                         "type": {
> >                           "type": "array",
> >                           "items": "string"
> >                         }
> >                       }
> >                     ]
> >                   }
> >                 }
> >               ]
> >             }
> >           }
> >         ]
> >       }
> >     }
> >   ]
> > }
> >
> > Which isn't absolutely what I'm looking for. Is it possible to define
> such schema in Avro?
>

Re: Avro schema having Map of Records

Posted by Ryan Skraba <ry...@skraba.com>.
Hello -- Avro supports a map type:
https://avro.apache.org/docs/1.9.0/spec.html#Maps

Generating an Avro schema from a JSON example can be ambiguous since a
JSON object can either be converted to a record or a map.  You're
probably looking for something like this:

{
  "type" : "record",
  "name" : "MyClass",
  "namespace" : "com.acme.avro",
  "fields" : [ {
    "name" : "one_level",
    "type" : {
      "type" : "record",
      "name" : "one_level",
      "fields" : [ {
        "name" : "inner_level",
        "type" : {
          "type" : "map",
          "values" : {
            "type" : "record",
            "name" : "sample",
            "fields" : [ {
              "name" : "sample1",
              "type" : "string"
            }, {
              "name" : "sample2",
              "type" : "string"
            } ]
          }
        }
      } ]
    }
  } ]
}

On Tue, Aug 6, 2019 at 10:47 AM Edgar H <ka...@gmail.com> wrote:
>
> I'm trying to translate a schema that I have in Spark which is defined for Parquet, and I would like to use it within Avro too.
>
>   StructField("one_level", StructType(List(StructField(
>     "inner_level",
>     MapType(
>       StringType,
>       StructType(
>         List(
>           StructField("field1", StringType),
>           StructField("field2", ArrayType(StringType))
>         )
>       )
>     )
>   )
> )), nullable = false)
>
> However, in Avro I haven't seen any examples of Maps containing Record type objects...
>
> Tried a sample input with an online Avro schema generator, taking this input.
>
> {
> "one_level": {
>     "inner_level": {
>         "sample1": {
>             "field1": "sample",
>             "field2": ["a", "b"],
>         },
>         "sample2": {
>             "field1": "sample2",
>             "field2": ["a", "b"]
>         }
>     }
> }
>
> }
>
> It prompts this output.
>
>     {
>   "name": "MyClass",
>   "type": "record",
>   "namespace": "com.acme.avro",
>   "fields": [
>     {
>       "name": "one_level",
>       "type": {
>         "name": "one_level",
>         "type": "record",
>         "fields": [
>           {
>             "name": "inner_level",
>             "type": {
>               "name": "inner_level",
>               "type": "record",
>               "fields": [
>                 {
>                   "name": "sample1",
>                   "type": {
>                     "name": "sample1",
>                     "type": "record",
>                     "fields": [
>                       {
>                         "name": "field1",
>                         "type": "string"
>                       },
>                       {
>                         "name": "field2",
>                         "type": {
>                           "type": "array",
>                           "items": "string"
>                         }
>                       }
>                     ]
>                   }
>                 },
>                 {
>                   "name": "sample2",
>                   "type": {
>                     "name": "sample2",
>                     "type": "record",
>                     "fields": [
>                       {
>                         "name": "field1",
>                         "type": "string"
>                       },
>                       {
>                         "name": "field2",
>                         "type": {
>                           "type": "array",
>                           "items": "string"
>                         }
>                       }
>                     ]
>                   }
>                 }
>               ]
>             }
>           }
>         ]
>       }
>     }
>   ]
> }
>
> Which isn't absolutely what I'm looking for. Is it possible to define such schema in Avro?