You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2010/03/01 20:40:05 UTC

[jira] Created: (AVRO-435) Support Set containers

Support Set containers
----------------------

                 Key: AVRO-435
                 URL: https://issues.apache.org/jira/browse/AVRO-435
             Project: Avro
          Issue Type: New Feature
            Reporter: Jonathan Ellis
            Priority: Minor


Cassandra uses Set as a return type for some methods.  It would be nice to not have to use a List as a workaround.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-435) Support Set containers

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840001#action_12840001 ] 

Scott Carey commented on AVRO-435:
----------------------------------

An ordered set is easily compared to an array, perhaps they should be ordered.

If things are unordered, comparison gets more complicated.  For example, the simple way to compare would be to build both sets up in memory -- this would not work for large sets or lists.
I propose that a set is ordered by default.  A client can do this by either sorting when writing, or with a data structure like a linked set.  A client can choose to disregard order and accept that set equivalence is invalid if they wish.

We could consider an "ordered": true property as well
"unique": false, "ordered": true
Would then be the implicit default.  An array or set that is not ordered is always unequal to another array or set.  A reader with an 'ordered' schema reading an 'unordered' serialization would be a tough spot.  That might not be a supported promotion.

{quote}
That leads us to the next question: What are the resolution rules between arrays and sets? My answer to this is: sets written can always be read as arrays. Arrays written can be read as sets as long as the uniqueness constraint is not violated.{quote}

One can always read an array as a set, even with duplicates.  The duplicates get eliminated in the process of creating the set.  Interestingly, one can go either direction, but not back and forth.

I think in the short run, Doug's version is appropriate.  The above would be valuable but also take a while to sort out the details for what works best across languages and is a spec change, rather than a Java API extension.  Besides, it should be possible to specify what object to use as an array container regardless.

Taken in combination with AVRO-436, ordered maps, these two things are potentially significant changes for something that clients can emulate on their own.  Ordered maps can be emulated by an array of {key, value} tuples.
The simplest option other than Doug's Java Reflect API version is to require that Avro sets are ordered for the purposes of equivalence and comparisson, and if a client wants to compare two objects for equality or sort order they must guarantee order on writing (this restriction already happens when serializing sets as arrays).  
"unique": (true|false) is then just a reserved keyword hint for languages to construct Set - like APIs for data access.
We can consider adding the more difficult support for unordered sets/lists incrementally.

> Support Set containers
> ----------------------
>
>                 Key: AVRO-435
>                 URL: https://issues.apache.org/jira/browse/AVRO-435
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Jonathan Ellis
>            Priority: Minor
>
> Cassandra uses Set as a return type for some methods.  It would be nice to not have to use a List as a workaround.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-435) Support Set containers

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840257#action_12840257 ] 

Doug Cutting commented on AVRO-435:
-----------------------------------

My preference is an attribute that's a hint to the runtime, e.g., "useSet":true might indicate that a set-like representation should be used if available, but permitting implementations to ignore the annotation and still process the data correctly.  If a writer intends setwise comparisons, it must sort the array before writing.  We could require that, if this attribute is specified, that no duplicates are present and perhaps that things are sorted, but we should not force readers to check this or otherwise handle these any differently than other arrays.

For Cassandra, what matters is that the Java specific API represent these differently than arrays.  So, in terms of a patch, that's the implementation to target.  We might actually implement it in generic too, since specific inherits its array code from there anyway.  But I don't think we need to implement it anywhere else right off, since it is an optional feature.  So SpecificCompiler should generate a different type when this attribute is specified, a GenericData.Set class should be added, extending GenericData.Array, and GenericDatumReader#newArray() should create an instance of GenericData.Set when "useSet" is specified.  Does that sound like a reasonable plan?


> Support Set containers
> ----------------------
>
>                 Key: AVRO-435
>                 URL: https://issues.apache.org/jira/browse/AVRO-435
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Jonathan Ellis
>            Priority: Minor
>
> Cassandra uses Set as a return type for some methods.  It would be nice to not have to use a List as a workaround.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-435) Support Set containers

Posted by "Thiruvalluvan M. G. (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839982#action_12839982 ] 

Thiruvalluvan M. G. commented on AVRO-435:
------------------------------------------

+1 for "unique":"true".

Array differs from set in two ways:
   - Arrays guarantee the order of elements and sets don't. Using arrays to implement sets shouldn't be a problem.
   - Arrays do not guarantee uniqueness. Using arrays for sets would work if we can define _equality_ in the spec. Unfortunately we cannot say "two entities are equal if their schemas are equal and their bit representations in Avro are equal". The trouble comes from maps (and sets when we have them) because Avro does not enforce order in them. But we can define _equality_ something like this:
      - Two primitive entities are equal if and only if their schemas and Avro binary representations are equal.
      - Two complex entities (other than maps and sets) are equal if an only if their schemas are equal and their contents are equal. For example, two unions are equal if and only if their union indexes are equal and the union members are equal.
      - Two Maps(sets) are equal if and only if their schemas are equal and their elements are equal except for their order.

Another question to answer is: Is it the responsibility of the Avro library to ensure uniqueness? My answer is no. Since, in Avro we interpret the contents assuming that the the schema we have is indeed the schema that the writer used, we can trust "unique":"true" as well.

That leads us to the next question: What are the resolution rules between arrays and sets? My answer to this is: sets written can always be read as arrays. Arrays written can be read as sets as long as the uniqueness constraint is not violated. It extends our schema resolution philosophy - we try to be as lenient as possible unless we encounter data violations.


> Support Set containers
> ----------------------
>
>                 Key: AVRO-435
>                 URL: https://issues.apache.org/jira/browse/AVRO-435
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Jonathan Ellis
>            Priority: Minor
>
> Cassandra uses Set as a return type for some methods.  It would be nice to not have to use a List as a workaround.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-435) Support Set containers

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839856#action_12839856 ] 

Scott Carey commented on AVRO-435:
----------------------------------


I think that this is best done at the API level with serialization unchanged.  Something like Doug's proposal should work.  

For cross-language purposes farther down the road, we might want to have some support for this in schema metadata that is not language specific.  Perhaps something like.

{"type": "array", "items": "int", "subtype": "set"}  

(or, "unique": true  ?)

Languages could optionally do something with that information.

> Support Set containers
> ----------------------
>
>                 Key: AVRO-435
>                 URL: https://issues.apache.org/jira/browse/AVRO-435
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Jonathan Ellis
>            Priority: Minor
>
> Cassandra uses Set as a return type for some methods.  It would be nice to not have to use a List as a workaround.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-435) Support Set containers

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hammerbacher updated AVRO-435:
-----------------------------------

    Component/s: spec

> Support Set containers
> ----------------------
>
>                 Key: AVRO-435
>                 URL: https://issues.apache.org/jira/browse/AVRO-435
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Jonathan Ellis
>            Priority: Minor
>
> Cassandra uses Set as a return type for some methods.  It would be nice to not have to use a List as a workaround.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-435) Support Set containers

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839831#action_12839831 ] 

Doug Cutting commented on AVRO-435:
-----------------------------------

Adding a new fundamental type to Avro is difficult without breaking back-compatibility.  But a way to achieve this might be with a schema attribute.  For example, the Java reflect API uses the "java-class" attribute to indicate which class implements a collection.  Thus the schema ReflectDatumWriter infers for HashSet<String> is {"type": "array", "items": "string", "java-class": "java.util.HashSet"}.  ReflectDatumReader then instantiates this appropriately.

Might something like this work for Cassandra?


> Support Set containers
> ----------------------
>
>                 Key: AVRO-435
>                 URL: https://issues.apache.org/jira/browse/AVRO-435
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Jonathan Ellis
>            Priority: Minor
>
> Cassandra uses Set as a return type for some methods.  It would be nice to not have to use a List as a workaround.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.