You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Wai Yip Tung <wy...@tungwaiyip.info> on 2014/06/05 01:40:26 UTC

Union resolution in dynamic languages

For encoding data of union type, the Avro specification do not say a lot 
which one of the type in the union is used. So far I am mostly using 
union so that I can write null or another simple type. In these cases, 
it is fairly obvious for the encoding to distinguish null from other types.

However a union can also be any named types. So they can be two records. 
Let say a Manger record and a NonManager record. I think with strongly 
typed languages, the suitable type in the union can be selected by 
introspection. But for dynamic languages, these might just be a 
represented as maps without any notion of type. In some case, we may 
find that the object has all the attributes of a NonManager but not the 
Manager. So we can conclude NonManager is the proper schema to use. But 
this can get complicated with nested data structure where the attribute 
that can disambiguate thing appear in a deeper level. Or you can think 
of valid scenario where inspecting the content of the obj cannot 
unambiguously resolve the union branch.

I notice that the Python implementation use two pass recursive 
validation possible for the reason of for resolving the union choice.

I am wonder if there are much consideration about are potentially 
complex, indirectly nested union types that might be difficult to 
resolve? Thus adding complexity to the implementation of the encoders? 
Are there use case in practice that involve complex union decision?

Wai Yip


Re: Union resolution in dynamic languages

Posted by Philip Zeyliger <ph...@cloudera.com>.
For what it's worth, I've taken, for reasons including the tricky handling
in dynamic things, I've taken to defining "unions" in the Thrift or
Protocol Buffer style.  Instead of "union(A,B,C,D)", I do
"struct(union(null, A) a, union(null, B) b, union(null, C) c, union(null,
D) d").  Note that this implies certain storage inefficiencies.  I'm doing
this in RPC-land, which the extra few bytes aren't bothering me.

-- Philip


On Thu, Jun 5, 2014 at 11:00 AM, Grant Overby (groverby) <groverby@cisco.com
> wrote:

>   Sure, but that is kind of an unbounded question. Can you be more
> specific as to what you’re looking for?
>
>  Here’s a shot at an answer:
> Polymorphism is a weak spot for Avro; unions help get around that short
> coming. We have unions which contain multiple record specifications. The
> reference that has a union datatype in the schema could point to an
> instance of one of many classes at runtime with which class that is being
> known only at runtime.
>
>
>        *Grant Overby*
> Software Engineer
> Cisco.com
> groverby@cisco.com
> Mobile: *865 724 4910 <865%20724%204910>*
>
>
>
>        Think before you print.
>
> This email may contain confidential and privileged material for the sole
> use of the intended recipient. Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>
> Please click here
> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
> Company Registration Information.
>
>
>
>   From: Wai Yip Tung <wy...@tungwaiyip.info>
> Reply-To: "user@avro.apache.org" <us...@avro.apache.org>
> Date: Thursday, June 5, 2014 at 1:40 PM
>
> To: "user@avro.apache.org" <us...@avro.apache.org>
> Subject: Re: Union resolution in dynamic languages
>
>  That's good to know. Would you mind sharing your use case with us?
>
> Wai Yip
>
>    Grant Overby (groverby) <gr...@cisco.com>
> Thursday, June 05, 2014 6:46 AM
>   Disallowing multiple named types within a union would break our use
> cases.
>
>  We have a similar problem. With two record types in a union, the Python
> driver doesn’t choose well.
>
>  We solved this problem by adding a pseudo-reserved key to the dict to
> indicate which named type to use. I started the process of open sourcing
> that patch a few days ago. It’s definitely a hack, but I’m hoping the
> community will accept it.
>
>  Our patch doesn’t change the time complexity. From a brief glance ,
> choosing within the union seems to typically be O(n) as the recursion short
> circuits. For named types, the complexity could be O(1). Achieving O(1) for
> non named types seems achievable too. How many projects are impacted by
> this ‘wasted’ complexity? Simpler code might be better than faster code.
>
>        *Grant Overby*
> Software Engineer
> Cisco.com
> groverby@cisco.com
> Mobile: *865 724 4910 <865%20724%204910>*
>
>
>
>
>      Think before you print.
>
> This email may contain confidential and privileged material for the sole
> use of the intended recipient. Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>
> Please click here
> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
> Company Registration Information.
>
>
>   From: Wai Yip Tung <wy...@tungwaiyip.info>
> Reply-To: "user@avro.apache.org" <us...@avro.apache.org>
> Date: Wednesday, June 4, 2014 at 9:34 PM
> To: "user@avro.apache.org" <us...@avro.apache.org>
> Subject: Re: Union resolution in dynamic languages
>
>  Also I ask about this in the context of building an optimized encoder. For
> this implementation, the resolution will be much simpler if we limit
> union to not support two records, similar to the spec do not allow two
> array or two map types. I wonder if this limit breaks any significant use
> case.
>
> Wai Yip
>     Wai Yip Tung <wy...@tungwaiyip.info>
> Wednesday, June 04, 2014 6:34 PM
>   Also I ask about this in the context of building an optimized encoder. For
> this implementation, the resolution will be much simpler if we limit
> union to not support two records, similar to the spec do not allow two
> array or two map types. I wonder if this limit breaks any significant use
> case.
>
> Wai Yip
>    Wai Yip Tung <wy...@tungwaiyip.info>
> Wednesday, June 04, 2014 4:40 PM
>   For encoding data of union type, the Avro specification do not say a
> lot which one of the type in the union is used. So far I am mostly using
> union so that I can write null or another simple type. In these cases, it
> is fairly obvious for the encoding to distinguish null from other types.
>
> However a union can also be any named types. So they can be two records.
> Let say a Manger record and a NonManager record. I think with strongly
> typed languages, the suitable type in the union can be selected by
> introspection. But for dynamic languages, these might just be a represented
> as maps without any notion of type. In some case, we may find that the
> object has all the attributes of a NonManager but not the Manager. So we
> can conclude NonManager is the proper schema to use. But this can get
> complicated with nested data structure where the attribute that can
> disambiguate thing appear in a deeper level. Or you can think of valid
> scenario where inspecting the content of the obj cannot unambiguously
> resolve the union branch.
>
> I notice that the Python implementation use two pass recursive validation
> possible for the reason of for resolving the union choice.
>
> I am wonder if there are much consideration about are potentially complex,
> indirectly nested union types that might be difficult to resolve? Thus
> adding complexity to the implementation of the encoders? Are there use case
> in practice that involve complex union decision?
>
> Wai Yip
>
>

Re: Union resolution in dynamic languages

Posted by "Grant Overby (groverby)" <gr...@cisco.com>.
Sure, but that is kind of an unbounded question. Can you be more specific as to what you’re looking for?

Here’s a shot at an answer:
Polymorphism is a weak spot for Avro; unions help get around that short coming. We have unions which contain multiple record specifications. The reference that has a union datatype in the schema could point to an instance of one of many classes at runtime with which class that is being known only at runtime.


[http://www.cisco.com/web/europe/images/email/signature/est2014/logo_06.png?ct=1398192119726]

Grant Overby
Software Engineer
Cisco.com
groverby@cisco.com<ma...@cisco.com>
Mobile: 865 724 4910






[http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif] Think before you print.

This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message.

Please click here<http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for Company Registration Information.




From: Wai Yip Tung <wy...@tungwaiyip.info>>
Reply-To: "user@avro.apache.org<ma...@avro.apache.org>" <us...@avro.apache.org>>
Date: Thursday, June 5, 2014 at 1:40 PM
To: "user@avro.apache.org<ma...@avro.apache.org>" <us...@avro.apache.org>>
Subject: Re: Union resolution in dynamic languages

That's good to know. Would you mind sharing your use case with us?

Wai Yip

[cid:part1.00000904.07040903@tungwaiyip.info]
Grant Overby (groverby)<ma...@cisco.com>
Thursday, June 05, 2014 6:46 AM
Disallowing multiple named types within a union would break our use cases.

We have a similar problem. With two record types in a union, the Python driver doesn’t choose well.

We solved this problem by adding a pseudo-reserved key to the dict to indicate which named type to use. I started the process of open sourcing that patch a few days ago. It’s definitely a hack, but I’m hoping the community will accept it.

Our patch doesn’t change the time complexity. From a brief glance , choosing within the union seems to typically be O(n) as the recursion short circuits. For named types, the complexity could be O(1). Achieving O(1) for non named types seems achievable too. How many projects are impacted by this ‘wasted’ complexity? Simpler code might be better than faster code.

[http://www.cisco.com/web/europe/images/email/signature/est2014/logo_06.png?ct=1398192119726]

Grant Overby
Software Engineer
Cisco.com
groverby@cisco.com<ma...@cisco.com>
Mobile: 865 724 4910






[http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif] Think before you print.

This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message.

Please click here<http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for Company Registration Information.




From: Wai Yip Tung <wy...@tungwaiyip.info>>
Reply-To: "user@avro.apache.org<ma...@avro.apache.org>" <us...@avro.apache.org>>
Date: Wednesday, June 4, 2014 at 9:34 PM
To: "user@avro.apache.org<ma...@avro.apache.org>" <us...@avro.apache.org>>
Subject: Re: Union resolution in dynamic languages

Also I ask about this in the context of building an optimized encoder. For this implementation, the resolution will be much simpler if we limit union to not support two records, similar to the spec do not allow two array or two map types. I wonder if this limit breaks any significant use case.

Wai Yip
[cid:part2.05020606.06070201@tungwaiyip.info]
Wai Yip Tung<ma...@tungwaiyip.info>
Wednesday, June 04, 2014 6:34 PM
Also I ask about this in the context of building an optimized encoder. For this implementation, the resolution will be much simpler if we limit union to not support two records, similar to the spec do not allow two array or two map types. I wonder if this limit breaks any significant use case.

Wai Yip
[cid:part2.05020606.06070201@tungwaiyip.info]
Wai Yip Tung<ma...@tungwaiyip.info>
Wednesday, June 04, 2014 4:40 PM
For encoding data of union type, the Avro specification do not say a lot which one of the type in the union is used. So far I am mostly using union so that I can write null or another simple type. In these cases, it is fairly obvious for the encoding to distinguish null from other types.

However a union can also be any named types. So they can be two records. Let say a Manger record and a NonManager record. I think with strongly typed languages, the suitable type in the union can be selected by introspection. But for dynamic languages, these might just be a represented as maps without any notion of type. In some case, we may find that the object has all the attributes of a NonManager but not the Manager. So we can conclude NonManager is the proper schema to use. But this can get complicated with nested data structure where the attribute that can disambiguate thing appear in a deeper level. Or you can think of valid scenario where inspecting the content of the obj cannot unambiguously resolve the union branch.

I notice that the Python implementation use two pass recursive validation possible for the reason of for resolving the union choice.

I am wonder if there are much consideration about are potentially complex, indirectly nested union types that might be difficult to resolve? Thus adding complexity to the implementation of the encoders? Are there use case in practice that involve complex union decision?

Wai Yip


Re: Union resolution in dynamic languages

Posted by Wai Yip Tung <wy...@tungwaiyip.info>.
That's good to know. Would you mind sharing your use case with us?

Wai Yip

> Grant Overby (groverby) <ma...@cisco.com>
> Thursday, June 05, 2014 6:46 AM
> Disallowing multiple named types within a union would break our use cases.
>
> We have a similar problem. With two record types in a union, the 
> Python driver doesn’t choose well.
>
> We solved this problem by adding a pseudo-reserved key to the dict to 
> indicate which named type to use. I started the process of open 
> sourcing that patch a few days ago. It’s definitely a hack, but I’m 
> hoping the community will accept it.
>
> Our patch doesn’t change the time complexity. From a brief glance , 
> choosing within the union seems to typically be O(n) as the recursion 
> short circuits. For named types, the complexity could be O(1). 
> Achieving O(1) for non named types seems achievable too. How many 
> projects are impacted by this ‘wasted’ complexity? Simpler code might 
> be better than faster code.
>
> *Grant Overby*
> Software Engineer
> Cisco.com
> groverby@cisco.com <ma...@cisco.com>
> Mobile: *865 724 4910*
>
> 	
>
> **
>
>
> Think before you print.
>
> This email may contain confidential and privileged material for the 
> sole use of the intended recipient. Any review, use, distribution or 
> disclosure by others is strictly prohibited. If you are not the 
> intended recipient (or authorized to receive for the recipient), 
> please contact the sender by reply email and delete all copies of this 
> message.
>
> Please click here 
> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> 
> for Company Registration Information.
>
>
>
> From: Wai Yip Tung <wy@tungwaiyip.info <ma...@tungwaiyip.info>>
> Reply-To: "user@avro.apache.org <ma...@avro.apache.org>" 
> <user@avro.apache.org <ma...@avro.apache.org>>
> Date: Wednesday, June 4, 2014 at 9:34 PM
> To: "user@avro.apache.org <ma...@avro.apache.org>" 
> <user@avro.apache.org <ma...@avro.apache.org>>
> Subject: Re: Union resolution in dynamic languages
>
> Also I ask about this in the context of building an optimized encoder. 
> For this implementation, the resolution will be much simpler if we 
> limit union to not support two records, similar to the spec do not 
> allow two array or two map types. I wonder if this limit breaks any 
> significant use case.
>
> Wai Yip
> Wai Yip Tung <ma...@tungwaiyip.info>
> Wednesday, June 04, 2014 6:34 PM
> Also I ask about this in the context of building an optimized encoder. 
> For this implementation, the resolution will be much simpler if we 
> limit union to not support two records, similar to the spec do not 
> allow two array or two map types. I wonder if this limit breaks any 
> significant use case.
>
> Wai Yip
> Wai Yip Tung <ma...@tungwaiyip.info>
> Wednesday, June 04, 2014 4:40 PM
> For encoding data of union type, the Avro specification do not say a 
> lot which one of the type in the union is used. So far I am mostly 
> using union so that I can write null or another simple type. In these 
> cases, it is fairly obvious for the encoding to distinguish null from 
> other types.
>
> However a union can also be any named types. So they can be two 
> records. Let say a Manger record and a NonManager record. I think with 
> strongly typed languages, the suitable type in the union can be 
> selected by introspection. But for dynamic languages, these might just 
> be a represented as maps without any notion of type. In some case, we 
> may find that the object has all the attributes of a NonManager but 
> not the Manager. So we can conclude NonManager is the proper schema to 
> use. But this can get complicated with nested data structure where the 
> attribute that can disambiguate thing appear in a deeper level. Or you 
> can think of valid scenario where inspecting the content of the obj 
> cannot unambiguously resolve the union branch.
>
> I notice that the Python implementation use two pass recursive 
> validation possible for the reason of for resolving the union choice.
>
> I am wonder if there are much consideration about are potentially 
> complex, indirectly nested union types that might be difficult to 
> resolve? Thus adding complexity to the implementation of the encoders? 
> Are there use case in practice that involve complex union decision?
>
> Wai Yip
>

Re: Union resolution in dynamic languages

Posted by "Grant Overby (groverby)" <gr...@cisco.com>.
Disallowing multiple named types within a union would break our use cases.

We have a similar problem. With two record types in a union, the Python driver doesn’t choose well.

We solved this problem by adding a pseudo-reserved key to the dict to indicate which named type to use. I started the process of open sourcing that patch a few days ago. It’s definitely a hack, but I’m hoping the community will accept it.

Our patch doesn’t change the time complexity. From a brief glance , choosing within the union seems to typically be O(n) as the recursion short circuits. For named types, the complexity could be O(1). Achieving O(1) for non named types seems achievable too. How many projects are impacted by this ‘wasted’ complexity? Simpler code might be better than faster code.

[http://www.cisco.com/web/europe/images/email/signature/est2014/logo_06.png?ct=1398192119726]

Grant Overby
Software Engineer
Cisco.com
groverby@cisco.com<ma...@cisco.com>
Mobile: 865 724 4910






[http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif] Think before you print.

This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message.

Please click here<http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for Company Registration Information.




From: Wai Yip Tung <wy...@tungwaiyip.info>>
Reply-To: "user@avro.apache.org<ma...@avro.apache.org>" <us...@avro.apache.org>>
Date: Wednesday, June 4, 2014 at 9:34 PM
To: "user@avro.apache.org<ma...@avro.apache.org>" <us...@avro.apache.org>>
Subject: Re: Union resolution in dynamic languages

Also I ask about this in the context of building an optimized encoder. For this implementation, the resolution will be much simpler if we limit union to not support two records, similar to the spec do not allow two array or two map types. I wonder if this limit breaks any significant use case.

Wai Yip
[cid:part1.06070604.00010309@tungwaiyip.info]
Wai Yip Tung<ma...@tungwaiyip.info>
Wednesday, June 04, 2014 4:40 PM
For encoding data of union type, the Avro specification do not say a lot which one of the type in the union is used. So far I am mostly using union so that I can write null or another simple type. In these cases, it is fairly obvious for the encoding to distinguish null from other types.

However a union can also be any named types. So they can be two records. Let say a Manger record and a NonManager record. I think with strongly typed languages, the suitable type in the union can be selected by introspection. But for dynamic languages, these might just be a represented as maps without any notion of type. In some case, we may find that the object has all the attributes of a NonManager but not the Manager. So we can conclude NonManager is the proper schema to use. But this can get complicated with nested data structure where the attribute that can disambiguate thing appear in a deeper level. Or you can think of valid scenario where inspecting the content of the obj cannot unambiguously resolve the union branch.

I notice that the Python implementation use two pass recursive validation possible for the reason of for resolving the union choice.

I am wonder if there are much consideration about are potentially complex, indirectly nested union types that might be difficult to resolve? Thus adding complexity to the implementation of the encoders? Are there use case in practice that involve complex union decision?

Wai Yip


Re: Union resolution in dynamic languages

Posted by Wai Yip Tung <wy...@tungwaiyip.info>.
Also I ask about this in the context of building an optimized encoder. 
For this implementation, the resolution will be much simpler if we limit 
union to not support two records, similar to the spec do not allow two 
array or two map types. I wonder if this limit breaks any significant 
use case.

Wai Yip
> Wai Yip Tung <ma...@tungwaiyip.info>
> Wednesday, June 04, 2014 4:40 PM
> For encoding data of union type, the Avro specification do not say a 
> lot which one of the type in the union is used. So far I am mostly 
> using union so that I can write null or another simple type. In these 
> cases, it is fairly obvious for the encoding to distinguish null from 
> other types.
>
> However a union can also be any named types. So they can be two 
> records. Let say a Manger record and a NonManager record. I think with 
> strongly typed languages, the suitable type in the union can be 
> selected by introspection. But for dynamic languages, these might just 
> be a represented as maps without any notion of type. In some case, we 
> may find that the object has all the attributes of a NonManager but 
> not the Manager. So we can conclude NonManager is the proper schema to 
> use. But this can get complicated with nested data structure where the 
> attribute that can disambiguate thing appear in a deeper level. Or you 
> can think of valid scenario where inspecting the content of the obj 
> cannot unambiguously resolve the union branch.
>
> I notice that the Python implementation use two pass recursive 
> validation possible for the reason of for resolving the union choice.
>
> I am wonder if there are much consideration about are potentially 
> complex, indirectly nested union types that might be difficult to 
> resolve? Thus adding complexity to the implementation of the encoders? 
> Are there use case in practice that involve complex union decision?
>
> Wai Yip
>