You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Scott Carey (JIRA)" <ji...@apache.org> on 2011/07/14 19:30:59 UTC
[jira] [Commented] (AVRO-859) Java: Data Flow Overhaul -- Composition and Symmetry

    [ https://issues.apache.org/jira/browse/AVRO-859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065402#comment-13065402 ] 

Scott Carey commented on AVRO-859:
----------------------------------

h4. Functional Composition
All read and write operations can be broken into functional bits and composed rather than writing monolithic classes.  This allows a "DatumWriter2" to be a graph of functions that pre-compute all state required from a schema rather than traverse a schema for each write.  Additionally, if the functions are all of a common set of types, it becomes easy to use code generation:  either directly or by parsing the resulting function graph and converting to code that the JVM can better optimize.

h4. Symmetry
Avro's data flow can be made symmetric.  Rather than thinking in terms of Read and Write, think in terms of:
* _*Source*_: Where data that is represented by an Avro schema comes from -- this may be a Decoder, or an Object graph.
* _*Target*_: Where data that represents an Avro schema is sent -- this may be an Encoder or an Object graph.

Combine the two ideas together and you can create _*Flows*_ -- The combination of a Source and a Target for a specific Schema (or resolvable Schema pair).
The machinery that requires traversing and resolving schemas can be written once, and "DatumReader" written once, with different source and targets combined to make different tools:
* An Decoder source + GenericData target = GenericDatumReader
* A SpecificData source +  Encoder target = GenericDatumWriter
* BinaryDecoder source + JsonEncoder target = transform from binary to json without any intermediate objects!
* SpecificData source + GeneridData target = transform one object type to another

Add in new sources and targets (Pig, ProtoBuf, Thrift objects; Pig binary, Protobuf binary, Thrift binary) and you can mix/match more transformation tasks.

Additinally, one can write a generic Equals/Compare implementation that takes two Sources, and compares them or checks for equality.  Then, you can compare binary with an object, or two objects.
Data flow could also tee:  one source with many targets.



h4. Functional units
After much prototyping and desingn, I have identified that all Avro data flow can be done by the composition of two functors:
The Unary Functor, which I have named *Access*: 
{code}
Access<A,B> {
 B access(A a);
}
{code}
And a Binary Functor with two types named *Flow*:
{code}
Flow<A,B> {
 B flow(A a, B b);
}
{code}
In most cases, you can replace "A" with "FROM" and "B" with "TO" in relation to Target and Source concepts.  These functions can naturally compose in all the ways required for data to flow from a target to a source.

.h4 Making Symmetry
Consider this simple example, a Flow over the schema: 
{code}
{"type": "record", "name":"Foo", "fields":
  [{"type":"int"}]
}
{code}

In the current implementation, a GenericDatumReader has the following API:
{code}
D read(D reuse, Decoder in);
{code}
which internally parses a Schema step by step, recursively calling methods with a similar signature.
When we get to the leaf field, we return an integer, and on return insert that into a GenericData.Record as the first field.
A very similar process occurs with GenericDatumWriter:
{code}
void write(D datum, Encoder out);
{code}
Which traverses a schema, recursively calling methods with a similar signature.
On the way down the schema graph, we access objects and pass portions of the data through, and when we hit the leaf field, we write it to the encoder and return.

Consider the innermost operation for both of the above:
Fetch an integer, then put it somewhere:
|| step || Source || Target || Source op || Target op || flow signature ||
| read an integer | IndexedRecord | Encoder | IndexedRecord.get() | (null) | int access(IndexedRecord) |
| read an integer | Decoder | IndexedRecord | Decoder.readInt() | (null) | int access(Decoder) |
| send integer to output | IndexedRecord | Encoder | (null) | Encoder.writeInt() | Encoder flow(int, Encoder) |
| send integer to output | Decoder | IndexedRecord | (null) | IndexedRecord.put() | IndexedRecord flow(int, IndexedRecord) |

The access and flow signatures compose as follows:
{code}
int access(A);
 FollowedBy
B flow(int, B);
Equals:

B flow(A, B);
{code}

So the above two examples compose to:
|| step || Source || Target || Source op || Target op || flow signature ||
| int flow | IndexedRecord | Encoder | IndexedRecord.get() | Encoder.writeInt() | Encoder flow(IndexedRecord, Encoder) |
| int flow | Decoder | IndexedRecord | Decoder.readInt() | IndexedRecord.put() | IndexedRecord flow(Decoder, IndexedRecord) |

As can be seen, one can compose the following two functions for an integer field, one function provided by the Source, and one function provided by the Target, and produce a Flow of data between them.
The source and target each have their own contexts -- the object types that an integer field represents -- but to not have to know anything about the other side.  The flow composition also does not need any information about the source or target -- they meet only at "int".

> Java: Data Flow Overhaul -- Composition and Symmetry
> ----------------------------------------------------
>
>                 Key: AVRO-859
>                 URL: https://issues.apache.org/jira/browse/AVRO-859
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>
> Data flow in Avro is currently broken into two parts:  Read and Write.  These share many common patterns but almost no common code.  
> Additionally, the APIs for this are DatumReader and DatumWriter, which requires that implementations know how to traverse Schemas and use the Resolver.
> This is a proposal to overhaul the inner workings of Avro Java between the Decoder/Encoder APIs and DatumReader/DatumWriter such that there is significantly more code re-use and much greater opportunity for new features that can all share in general optimizations and dynamic code generation.
> The two primary concepts involved are:
> * _*Functional Composition*_
> * _*Symmetry*_
> h4. Functional Composition
> All read and write operations can be broken into functional bits and composed rather than writing monolithic classes.  This allows a "DatumWriter2" to be a graph of functions that pre-compute all state required from a schema rather than traverse a schema for each write.
> h4. Symmetry
> Avro's data flow can be made symmetric.  Rather than thinking in terms of Read and Write, think in terms of:
> * _*Source*_: Where data that is represented by an Avro schema comes from -- this may be a Decoder, or an Object graph.
> * _*Target*_: Where data that represents an Avro schema is sent -- this may be an Encoder or an Object graph.
> (More detail in the comments)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira