You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Ryan Blue <rb...@netflix.com.INVALID> on 2016/10/03 16:17:36 UTC

Re: Fixed dimension for array

Hi Walt,

Thanks for bringing this up, it sounds like an interesting idea.

Do you think this requires a change to the Avro definition? It's difficult
to change the format spec and add types at this point because it would be a
breaking change to the format. That would mean updating the spec and all
the read implementations and giving up forward-compatibility -- old readers
can read data written by newer ones.

I'm wondering if, instead, we could come up with a spec for storing tensors
and use a logical type annotation (examples at [1]). For example, you can
store a millisecond-precision timestamp in an Avro long, so adding a
timestamp doesn't require a breaking change to the spec because we just add
metadata to an existing type. A tensor could be defined like this:

  { "type": "array", "logicalType": "tensor", "dimensions": [3, 3, 3, 3],
"items": "double" }

Then you can supply code to convert between a multi-dimensional tensor
object and a one-dimensional array (see the decimal implementation for an
example [2]). That would eliminate the overhead of storing arrays of
arrays, but would be compatible with older readers that see a big array and
extra metadata.

Do you think that would work?

rb


[1]:
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/LogicalTypes.java
[2]:
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/Conversions.java#L62-L96


On Mon, Sep 19, 2016 at 11:42 AM, Walt - DMG <wa...@dmg.org> wrote:

> Greetings!
>
> We represent the Data Mining Group <http://dmg.org/>, a 501(c)3
> organization managing data mining standards such as the Portable Format for
> Analytics (PFA) <http://dmg.org/pfa/>. PFA is used by data scientists to
> transport and deploy predictive models in a standards-compliant way.
>
> You may be interested to know that PFA, a JSON-based format, uses Avro as
> its type system <http://dmg.org/pfa/docs/avro_types/>. Although this is
> tangential to Avro's main goal as a serialization system, it fits well with
> our need to describe structured types in JSON. (We even use Avro's schema
> resolution to identify subtypes for covariant function arguments.)
>
> In our development of PFA, we have found one kind of data structure that is
> hard to express in Avro: tensors. Although we can (and do) build matrices
> as {"type": "array", "items": {"type": "array", "items": "double"}}, this
> type does not specify that the grid of numbers is rectangular. We believe
> that rectangular arrays of numbers (or other nested types) would be a
> strong addition to Avro, both as a type system and as a serialization
> format. With the total size of all dimensions fixed in the schema, they
> would not need to be repeated in each serialized datum.
>
> For instance, suppose there was an extension of type "array" to specify
> dimensions:
>
> {"type": "array", "dimensions": [3, 3, 3, 3], "items": "double"}
>
>
> This 3-by-3-by-3-by-3 tensor (representing, for instance, the Riemann
> curvature tensor <https://en.wikipedia.org/wiki/Riemann_curvature_tensor>
> in 3-space) specifies that 81 double-precision numbers (3*3*3*3) are
> expected for each datum. With nested arrays, the size, "3," would have to
> be separately encoded 40 times (1 + 3*(1 + 3*(1 + 3))) for each datum, even
> though they never change in a dataset of Riemann tensors. With a
> "dimensions" attribute in the schema, only the content needs to be
> serialized. Moreover, this extension can clearly be used with any other
> "items" type, to make dense tables of strings, for instance.
>
> Avro has been extended in a similar way in the past. The "fixed" type is a
> "bytes" without the need to specify the number of bytes for each datum. Our
> proposal provides a similar packing for structured objects that can be
> significant for large numbers of dimensions, as shown above. The advantage
> to PFA is that we can write functions that do not need to check all array
> sizes at runtime (for operations like tensor contractions and products).
>
> We have searched the web and the Avro JIRA site for similar proposals and
> found none, so we're adding this proposal to JIRA (see issue 1922
> <https://issues.apache.org/jira/browse/AVRO-1922>) in addition to this
> e-mail. Please let us know if you have any comments, or if we can provide
> any more information.
>
> Thank you for your consideration!
> -- Walt Wells for the Data Mining Group
>



-- 
Ryan Blue
Software Engineer
Netflix