You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Ran S <ra...@liveperson.com> on 2013/05/23 16:15:23 UTC

using Avro unions with HIVE

Hi,
We started to work with Avro in CDH4 and to query the Avro files using Hive.
This does work fine for us, except for unions.
We do not understand how to query the data inside a union using Hive.

For example, let's look at the following schema:

{
	"type":"record", 
	"name":"event", 
	"namespace":"com.mysite",
	"fields":[
    {
        "name":"header",
        "type":{
            "type":"record", "name":"CommonHeader",
            "fields":[{ "name":"eventTimeStamp", "type":"long", efault":-1
},
                      { "name":"globalUserId", "type":["null", "string"],
"default":null } ]
        },
        "default":null
    },
    {
        "name":"eventbody",
        "type":{
            "type":"record", "name":"eventbody",
            "fields":[
                {
                    "name":"body",
                    "type":[
                       "null", 
                       {
                        "type":"record",
                        "name":"event1",
                        "fields":[
                            {
                                "name":"event1Header", 
                                "type":["null", { "type":"array",
"items":"string" }], "default":null
                            },
                            {
                                "name":"event1Body",
                                "type":["null", { "type":"array",
"items":"string" }], "default":null
                            }
                        ]
                    }, 
                   {
                        "type":"record",
                        "name":"event2",
                        "fields":[
                            {
                                "name":"page",
                                "type":{
                                    "type":"record", "name":"URL",
"fields":[{ "name":"url", "type":"string" }]
                                },
                                "default":null
                            },
                            {
                                "name":"referrer", "type":"string",
"default":null
                            }
                        ]
                    }
		],
                    "default":null
                }
            ]
        },
        "default":null
    }
]}

Note that "body" is a union of three types:
null, "event1" and "event2"

So if I want to query fields inside event1, I first need to access it.
I then set a HiveQL like this:
SELECT eventbody.body.??? from SRC

My question is: what shoule I put in the ??? above to make this work?

Thank you,
Ran



--
View this message in context: http://apache-avro.679487.n3.nabble.com/using-Avro-unions-with-HIVE-tp4027473.html
Sent from the Avro - Users mailing list archive at Nabble.com.

Re: using Avro unions with HIVE

Posted by Ran S <ra...@liveperson.com>.
Thank you Mark.
I will try to follow up in HIVE user-group on the ability to read data from
uniontype.

Ran



--
View this message in context: http://apache-avro.679487.n3.nabble.com/using-Avro-unions-with-HIVE-tp4027473p4027496.html
Sent from the Avro - Users mailing list archive at Nabble.com.

Re: using Avro unions with HIVE

Posted by Mark Wagner <wa...@gmail.com>.
Hi Ran,

Unfortunately, there's no real way to manipulate unions in Hive. The Avro
SerDe translates Avro unions into Hive unions correctly, but the support
for accessing those fields is not there. The exception to this is the
[null, T] pattern for nullable fields, which is handled by the Avro SerDe
transparently. This JIRA is tracking imporved union support for Hive, but
it's not being actively worked on:
https://issues.apache.org/jira/browse/HIVE-2390.

Thanks,
Mark


On Thu, May 23, 2013 at 11:45 AM, Scott Carey <sc...@apache.org> wrote:

> The Hive mailing list would have more info on the Avro SerDe usage.
>
> In general, a system that does not have union types like Hive (or Pig,
> etc) has to expand a union into multiple fields if there are more than one
> non-null type -- and at most one branch of the union is not null.
>
> For example a record with fields:
>
>   {"name":"timestamp", "type":"long", "default":-1}
>   {"name":"ipAddress", "type":["IPv4", "IPv6"]}
>
> where IPv4 and IPv6 are previously defined types, would have to expand to
> three fields
>  "timestamp", "ipAddress:IPv4", and "ipAddress:IPv6", where only one of
> the last two is not null in any given record.
>
> I do not know what Hive's Avro SerDe does with unions.
>
> On 5/23/13 7:15 AM, "Ran S" <ra...@liveperson.com> wrote:
>
> >Hi,
> >We started to work with Avro in CDH4 and to query the Avro files using
> >Hive.
> >This does work fine for us, except for unions.
> >We do not understand how to query the data inside a union using Hive.
> >
> >For example, let's look at the following schema:
> >
> >{
> >       "type":"record",
> >       "name":"event",
> >       "namespace":"com.mysite",
> >       "fields":[
> >    {
> >        "name":"header",
> >        "type":{
> >            "type":"record", "name":"CommonHeader",
> >            "fields":[{ "name":"eventTimeStamp", "type":"long", efault":-1
> >},
> >                      { "name":"globalUserId", "type":["null", "string"],
> >"default":null } ]
> >        },
> >        "default":null
> >    },
> >    {
> >        "name":"eventbody",
> >        "type":{
> >            "type":"record", "name":"eventbody",
> >            "fields":[
> >                {
> >                    "name":"body",
> >                    "type":[
> >                       "null",
> >                       {
> >                        "type":"record",
> >                        "name":"event1",
> >                        "fields":[
> >                            {
> >                                "name":"event1Header",
> >                                "type":["null", { "type":"array",
> >"items":"string" }], "default":null
> >                            },
> >                            {
> >                                "name":"event1Body",
> >                                "type":["null", { "type":"array",
> >"items":"string" }], "default":null
> >                            }
> >                        ]
> >                    },
> >                   {
> >                        "type":"record",
> >                        "name":"event2",
> >                        "fields":[
> >                            {
> >                                "name":"page",
> >                                "type":{
> >                                    "type":"record", "name":"URL",
> >"fields":[{ "name":"url", "type":"string" }]
> >                                },
> >                                "default":null
> >                            },
> >                            {
> >                                "name":"referrer", "type":"string",
> >"default":null
> >                            }
> >                        ]
> >                    }
> >               ],
> >                    "default":null
> >                }
> >            ]
> >        },
> >        "default":null
> >    }
> >]}
> >
> >Note that "body" is a union of three types:
> >null, "event1" and "event2"
> >
> >So if I want to query fields inside event1, I first need to access it.
> >I then set a HiveQL like this:
> >SELECT eventbody.body.??? from SRC
> >
> >My question is: what shoule I put in the ??? above to make this work?
> >
> >Thank you,
> >Ran
> >
> >
> >
> >--
> >View this message in context:
> >
> http://apache-avro.679487.n3.nabble.com/using-Avro-unions-with-HIVE-tp4027
> >473.html
> >Sent from the Avro - Users mailing list archive at Nabble.com.
>
>
>

Re: using Avro unions with HIVE

Posted by Mark Wagner <wa...@gmail.com>.
Hi Ran,

Unfortunately, there's no real way to manipulate unions in Hive. The Avro
SerDe translates Avro unions into Hive unions correctly, but the support
for accessing those fields is not there. The exception to this is the
[null, T] pattern for nullable fields, which is handled by the Avro SerDe
transparently. This JIRA is tracking imporved union support for Hive, but
it's not being actively worked on:
https://issues.apache.org/jira/browse/HIVE-2390.

Thanks,
Mark


On Thu, May 23, 2013 at 11:45 AM, Scott Carey <sc...@apache.org> wrote:

> The Hive mailing list would have more info on the Avro SerDe usage.
>
> In general, a system that does not have union types like Hive (or Pig,
> etc) has to expand a union into multiple fields if there are more than one
> non-null type -- and at most one branch of the union is not null.
>
> For example a record with fields:
>
>   {"name":"timestamp", "type":"long", "default":-1}
>   {"name":"ipAddress", "type":["IPv4", "IPv6"]}
>
> where IPv4 and IPv6 are previously defined types, would have to expand to
> three fields
>  "timestamp", "ipAddress:IPv4", and "ipAddress:IPv6", where only one of
> the last two is not null in any given record.
>
> I do not know what Hive's Avro SerDe does with unions.
>
> On 5/23/13 7:15 AM, "Ran S" <ra...@liveperson.com> wrote:
>
> >Hi,
> >We started to work with Avro in CDH4 and to query the Avro files using
> >Hive.
> >This does work fine for us, except for unions.
> >We do not understand how to query the data inside a union using Hive.
> >
> >For example, let's look at the following schema:
> >
> >{
> >       "type":"record",
> >       "name":"event",
> >       "namespace":"com.mysite",
> >       "fields":[
> >    {
> >        "name":"header",
> >        "type":{
> >            "type":"record", "name":"CommonHeader",
> >            "fields":[{ "name":"eventTimeStamp", "type":"long", efault":-1
> >},
> >                      { "name":"globalUserId", "type":["null", "string"],
> >"default":null } ]
> >        },
> >        "default":null
> >    },
> >    {
> >        "name":"eventbody",
> >        "type":{
> >            "type":"record", "name":"eventbody",
> >            "fields":[
> >                {
> >                    "name":"body",
> >                    "type":[
> >                       "null",
> >                       {
> >                        "type":"record",
> >                        "name":"event1",
> >                        "fields":[
> >                            {
> >                                "name":"event1Header",
> >                                "type":["null", { "type":"array",
> >"items":"string" }], "default":null
> >                            },
> >                            {
> >                                "name":"event1Body",
> >                                "type":["null", { "type":"array",
> >"items":"string" }], "default":null
> >                            }
> >                        ]
> >                    },
> >                   {
> >                        "type":"record",
> >                        "name":"event2",
> >                        "fields":[
> >                            {
> >                                "name":"page",
> >                                "type":{
> >                                    "type":"record", "name":"URL",
> >"fields":[{ "name":"url", "type":"string" }]
> >                                },
> >                                "default":null
> >                            },
> >                            {
> >                                "name":"referrer", "type":"string",
> >"default":null
> >                            }
> >                        ]
> >                    }
> >               ],
> >                    "default":null
> >                }
> >            ]
> >        },
> >        "default":null
> >    }
> >]}
> >
> >Note that "body" is a union of three types:
> >null, "event1" and "event2"
> >
> >So if I want to query fields inside event1, I first need to access it.
> >I then set a HiveQL like this:
> >SELECT eventbody.body.??? from SRC
> >
> >My question is: what shoule I put in the ??? above to make this work?
> >
> >Thank you,
> >Ran
> >
> >
> >
> >--
> >View this message in context:
> >
> http://apache-avro.679487.n3.nabble.com/using-Avro-unions-with-HIVE-tp4027
> >473.html
> >Sent from the Avro - Users mailing list archive at Nabble.com.
>
>
>

Re: using Avro unions with HIVE

Posted by Scott Carey <sc...@apache.org>.
The Hive mailing list would have more info on the Avro SerDe usage.

In general, a system that does not have union types like Hive (or Pig,
etc) has to expand a union into multiple fields if there are more than one
non-null type -- and at most one branch of the union is not null.

For example a record with fields:

  {"name":"timestamp", "type":"long", "default":-1}
  {"name":"ipAddress", "type":["IPv4", "IPv6"]}

where IPv4 and IPv6 are previously defined types, would have to expand to
three fields
 "timestamp", "ipAddress:IPv4", and "ipAddress:IPv6", where only one of
the last two is not null in any given record.

I do not know what Hive's Avro SerDe does with unions.

On 5/23/13 7:15 AM, "Ran S" <ra...@liveperson.com> wrote:

>Hi,
>We started to work with Avro in CDH4 and to query the Avro files using
>Hive.
>This does work fine for us, except for unions.
>We do not understand how to query the data inside a union using Hive.
>
>For example, let's look at the following schema:
>
>{
>	"type":"record", 
>	"name":"event", 
>	"namespace":"com.mysite",
>	"fields":[
>    {
>        "name":"header",
>        "type":{
>            "type":"record", "name":"CommonHeader",
>            "fields":[{ "name":"eventTimeStamp", "type":"long", efault":-1
>},
>                      { "name":"globalUserId", "type":["null", "string"],
>"default":null } ]
>        },
>        "default":null
>    },
>    {
>        "name":"eventbody",
>        "type":{
>            "type":"record", "name":"eventbody",
>            "fields":[
>                {
>                    "name":"body",
>                    "type":[
>                       "null",
>                       {
>                        "type":"record",
>                        "name":"event1",
>                        "fields":[
>                            {
>                                "name":"event1Header",
>                                "type":["null", { "type":"array",
>"items":"string" }], "default":null
>                            },
>                            {
>                                "name":"event1Body",
>                                "type":["null", { "type":"array",
>"items":"string" }], "default":null
>                            }
>                        ]
>                    },
>                   {
>                        "type":"record",
>                        "name":"event2",
>                        "fields":[
>                            {
>                                "name":"page",
>                                "type":{
>                                    "type":"record", "name":"URL",
>"fields":[{ "name":"url", "type":"string" }]
>                                },
>                                "default":null
>                            },
>                            {
>                                "name":"referrer", "type":"string",
>"default":null
>                            }
>                        ]
>                    }
>		],
>                    "default":null
>                }
>            ]
>        },
>        "default":null
>    }
>]}
>
>Note that "body" is a union of three types:
>null, "event1" and "event2"
>
>So if I want to query fields inside event1, I first need to access it.
>I then set a HiveQL like this:
>SELECT eventbody.body.??? from SRC
>
>My question is: what shoule I put in the ??? above to make this work?
>
>Thank you,
>Ran
>
>
>
>--
>View this message in context:
>http://apache-avro.679487.n3.nabble.com/using-Avro-unions-with-HIVE-tp4027
>473.html
>Sent from the Avro - Users mailing list archive at Nabble.com.