You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Walt - DMG <wa...@dmg.org> on 2016/09/19 18:42:35 UTC

Fixed dimension for array

Greetings!

We represent the Data Mining Group <http://dmg.org/>, a 501(c)3
organization managing data mining standards such as the Portable Format for
Analytics (PFA) <http://dmg.org/pfa/>. PFA is used by data scientists to
transport and deploy predictive models in a standards-compliant way.

You may be interested to know that PFA, a JSON-based format, uses Avro as
its type system <http://dmg.org/pfa/docs/avro_types/>. Although this is
tangential to Avro's main goal as a serialization system, it fits well with
our need to describe structured types in JSON. (We even use Avro's schema
resolution to identify subtypes for covariant function arguments.)

In our development of PFA, we have found one kind of data structure that is
hard to express in Avro: tensors. Although we can (and do) build matrices
as {"type": "array", "items": {"type": "array", "items": "double"}}, this
type does not specify that the grid of numbers is rectangular. We believe
that rectangular arrays of numbers (or other nested types) would be a
strong addition to Avro, both as a type system and as a serialization
format. With the total size of all dimensions fixed in the schema, they
would not need to be repeated in each serialized datum.

For instance, suppose there was an extension of type "array" to specify
dimensions:

{"type": "array", "dimensions": [3, 3, 3, 3], "items": "double"}


This 3-by-3-by-3-by-3 tensor (representing, for instance, the Riemann
curvature tensor <https://en.wikipedia.org/wiki/Riemann_curvature_tensor>
in 3-space) specifies that 81 double-precision numbers (3*3*3*3) are
expected for each datum. With nested arrays, the size, "3," would have to
be separately encoded 40 times (1 + 3*(1 + 3*(1 + 3))) for each datum, even
though they never change in a dataset of Riemann tensors. With a
"dimensions" attribute in the schema, only the content needs to be
serialized. Moreover, this extension can clearly be used with any other
"items" type, to make dense tables of strings, for instance.

Avro has been extended in a similar way in the past. The "fixed" type is a
"bytes" without the need to specify the number of bytes for each datum. Our
proposal provides a similar packing for structured objects that can be
significant for large numbers of dimensions, as shown above. The advantage
to PFA is that we can write functions that do not need to check all array
sizes at runtime (for operations like tensor contractions and products).

We have searched the web and the Avro JIRA site for similar proposals and
found none, so we're adding this proposal to JIRA (see issue 1922
<https://issues.apache.org/jira/browse/AVRO-1922>) in addition to this
e-mail. Please let us know if you have any comments, or if we can provide
any more information.

Thank you for your consideration!
-- Walt Wells for the Data Mining Group

Re: Fixed dimension for array

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Hi Walt,

Thanks for bringing this up, it sounds like an interesting idea.

Do you think this requires a change to the Avro definition? It's difficult
to change the format spec and add types at this point because it would be a
breaking change to the format. That would mean updating the spec and all
the read implementations and giving up forward-compatibility -- old readers
can read data written by newer ones.

I'm wondering if, instead, we could come up with a spec for storing tensors
and use a logical type annotation (examples at [1]). For example, you can
store a millisecond-precision timestamp in an Avro long, so adding a
timestamp doesn't require a breaking change to the spec because we just add
metadata to an existing type. A tensor could be defined like this:

  { "type": "array", "logicalType": "tensor", "dimensions": [3, 3, 3, 3],
"items": "double" }

Then you can supply code to convert between a multi-dimensional tensor
object and a one-dimensional array (see the decimal implementation for an
example [2]). That would eliminate the overhead of storing arrays of
arrays, but would be compatible with older readers that see a big array and
extra metadata.

Do you think that would work?

rb


[1]:
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/LogicalTypes.java
[2]:
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/Conversions.java#L62-L96


On Mon, Sep 19, 2016 at 11:42 AM, Walt - DMG <wa...@dmg.org> wrote:

> Greetings!
>
> We represent the Data Mining Group <http://dmg.org/>, a 501(c)3
> organization managing data mining standards such as the Portable Format for
> Analytics (PFA) <http://dmg.org/pfa/>. PFA is used by data scientists to
> transport and deploy predictive models in a standards-compliant way.
>
> You may be interested to know that PFA, a JSON-based format, uses Avro as
> its type system <http://dmg.org/pfa/docs/avro_types/>. Although this is
> tangential to Avro's main goal as a serialization system, it fits well with
> our need to describe structured types in JSON. (We even use Avro's schema
> resolution to identify subtypes for covariant function arguments.)
>
> In our development of PFA, we have found one kind of data structure that is
> hard to express in Avro: tensors. Although we can (and do) build matrices
> as {"type": "array", "items": {"type": "array", "items": "double"}}, this
> type does not specify that the grid of numbers is rectangular. We believe
> that rectangular arrays of numbers (or other nested types) would be a
> strong addition to Avro, both as a type system and as a serialization
> format. With the total size of all dimensions fixed in the schema, they
> would not need to be repeated in each serialized datum.
>
> For instance, suppose there was an extension of type "array" to specify
> dimensions:
>
> {"type": "array", "dimensions": [3, 3, 3, 3], "items": "double"}
>
>
> This 3-by-3-by-3-by-3 tensor (representing, for instance, the Riemann
> curvature tensor <https://en.wikipedia.org/wiki/Riemann_curvature_tensor>
> in 3-space) specifies that 81 double-precision numbers (3*3*3*3) are
> expected for each datum. With nested arrays, the size, "3," would have to
> be separately encoded 40 times (1 + 3*(1 + 3*(1 + 3))) for each datum, even
> though they never change in a dataset of Riemann tensors. With a
> "dimensions" attribute in the schema, only the content needs to be
> serialized. Moreover, this extension can clearly be used with any other
> "items" type, to make dense tables of strings, for instance.
>
> Avro has been extended in a similar way in the past. The "fixed" type is a
> "bytes" without the need to specify the number of bytes for each datum. Our
> proposal provides a similar packing for structured objects that can be
> significant for large numbers of dimensions, as shown above. The advantage
> to PFA is that we can write functions that do not need to check all array
> sizes at runtime (for operations like tensor contractions and products).
>
> We have searched the web and the Avro JIRA site for similar proposals and
> found none, so we're adding this proposal to JIRA (see issue 1922
> <https://issues.apache.org/jira/browse/AVRO-1922>) in addition to this
> e-mail. Please let us know if you have any comments, or if we can provide
> any more information.
>
> Thank you for your consideration!
> -- Walt Wells for the Data Mining Group
>



-- 
Ryan Blue
Software Engineer
Netflix