You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Robert Stupp (JIRA)" <ji...@apache.org> on 2015/07/06 20:06:04 UTC
[jira] [Commented] (CASSANDRA-6710) Support union types

    [ https://issues.apache.org/jira/browse/CASSANDRA-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14615413#comment-14615413 ] 

Robert Stupp commented on CASSANDRA-6710:
-----------------------------------------

I’ve just prepared two proposals for _union_ type in C*. Nothing there in form of code - just thoughts. Both proposals differ by _how_ a union is declared as a column type - i.e. with or without declaring the possible component-types up-front. Both have their own charm, but I think the downsides of the 2nd variant are too dangerous.

h3. General

* A union must occupy exactly one cell/atom (i.e. no splitting of union-type in one and union-value in another cell). That way there’s no need to special case the _null_ value for a _union_ - a union _is_ null or _is not_ null - nothing like _union is present but its value is null_.
* Can be used in primary key columns (like current collection and tuple/user types can)
* Although unions can be ”emulated“ using a tuple type, using a tuple would violate the ”contract“ of a union (or _Either_ type - see description above). IMO this justifies adding a _union_ type to C*
* Must comply to CASSANDRA-6717 and should be tackled after 6717 has landed

h3. Approach 1 - predefined set of types in union

This approach declares the possible types in the union up-front. From a data-modeling point-of-view it is clear what _can_ be in that union. It should also help with mapping _union_ to _Either_ types in functional programming languages like Scala and therefore Spark.

{noformat}
CREATE TYPE bar ( a text, b int );
CREATE TABLE/TYPE foo (
  pk int PRIMARY KEY,
  my_union union<int, bigint, timeuuid, text, frozen<bar>, frozen<set<bar>>> );
{noformat}

The schema definition would contain the (ordered) list of possible _component-types_ per column in a table declared using a _union_. That way all _component-types_ are indexed and can be referenced from within any union’s value. Serialization of the union type includes an _index_ to a union’s declared component-type. By using a single byte, unions with up to 128 (0-based, signed byte) components are theoretically possible - but honestly only a handful would be relevant in practice.

The serialized format for a cell/atom would look like this:

| {{\[byte\]}} | component-index | references the n-th _component-type_ (0-based) in the declaration of the union in the column or the containing table.
| {{\[bytes\]}} | data | serialized representation of the type - no need to handle nulls

*Optional*: {{ALTER TABLE foo ALTER my_union union<…>}} can *add* additional types to a union, but never remove one. Whether or not to implement this, is more a matter of _if_ we should support that, so lying in the area of _data modeling best-practices_. I tend to not implement this to be consistent with what’s possible with a tuple.

Pro:
* Just one byte overhead compared to any _raw_ type.
* Has a ”strong” reference to contained UDTs (see alternative 2 below) as a ”usual” column has. This ensures schema integrity and prohibits serialization errors (see alternative below).

Neutral:
* Only a predefined, but extensible set of types can be used. Honestly, this depends on one’s personal favor.

h3. Approach 2 - union with _any_ type

This alternative approach gives complete freedom of which types a union may contain during its whole lifetime. So it is completely contrary to what a C _union_ or an _Either_ does or should do. It also implies some major downsides wrt UDTs.

{noformat}
CREATE TYPE bar ( a text, b int );
CREATE TABLE/TYPE foo (
  pk int PRIMARY KEY,
  my_union union );
{noformat}

The serialized format for a cell/atom would look like this:

| {{\[string\]}} | type | cql3 type name
| {{\[bytes\]}} | data | serialized representation of the type - no need to handle nulls

Pro
* Very flexible by _which_ types can be used.

Contra
* Huge serialization overhead since the actual type must be serialized with the value. This might be reduced by using something similar as Java does for type signatures - i.e. using {{t}} for {{timeuuid}} and {{[foo.bar;}} for a UDF.
* UDTs are not strongly referenced. Creating a UDT, using it in a union, dropping + recreating a UDT with the same name but a different signature would likely cause serialization exceptions
* Fits more in the area in ”schema-less” that we want people to avoid.

h3. Native Protocol

Requires changes to the native protocol to data type serialization and schema-change notification and schema-change result messages.

h3. Java Driver

Proposal for the Java Driver (non-binding, of course - incomplete pseudo-code):

{noformat}
public class UnionValue {
  public int getInt();
  public String getString();
  /* more primitives */
  public UDTValue getUDTValue(UserType userType);
  public TupleValue getTupleValue(TupleType tupleType);
  public <E> Set<E> getSet(Class<E> elementType);
  public <E> List<E> getList(Class<E> elementType);
  public <K,V> Map<K,V> getMap(Class<K> keyType, Class<V> valueType);
  /* low-level */
  public DataType getType();
  public ByteBuffer getRaw();

  public void setInt(int v);
  public void setString(String v);
  /* more primitives */
  public void setUDTValue(UDTValue udtValue);
  public void setTupleValue(TupleValue tupleValue);
  public <E> void setSet(Class<E> elementType, Set<E> set);
  public <E> void setList(Class<E> elementType, List<E> list);
  public <K,V> void setMap(Class<K> keyType, Class<V> valueType, Map<K, V> map);
  /* low-level */
  public void setRaw(DataType type, ByteBuffer raw);
}
{noformat}

h3. cqlsh, Python Driver

There are obvious metadata, result set and statement enhancements in the Python Driver.
_cqlsh_ must also be able to format a union value depending on its actual type - so it adds a dynamic indirection to {{cqlshlib.format_by_type}} beside syntax/completion enhancements.


> Support union types
> -------------------
>
>                 Key: CASSANDRA-6710
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6710
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API, Core
>            Reporter: Tupshin Harper
>            Priority: Minor
>              Labels: ponies
>             Fix For: 3.x
>
>
> I sometimes find myself wanting to abuse Cassandra datatypes when I want to interleave two different types in the same column.
> An example is in CASSANDRA-6167 where an approach is to tag what would normally be a numeric field with text indicating that it is special in some ways.
> A more elegant approach would be to be able to explicitly define disjoint unions in the style of Haskell's and Scala's Either types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)