You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@avro.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2009/12/11 01:55:18 UTC

[jira] Created: (AVRO-251) add schema for schemas

add schema for schemas
----------------------

                 Key: AVRO-251
                 URL: https://issues.apache.org/jira/browse/AVRO-251
             Project: Avro
          Issue Type: New Feature
          Components: java
            Reporter: Doug Cutting
            Assignee: Doug Cutting
             Fix For: 1.3.0


A schema for schemas would permits schemas to be written in binary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-251) add schema for schemas

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790333#action_12790333 ] 

Doug Cutting commented on AVRO-251:
-----------------------------------

> I think the usecase for it (I know you have one in mind, and we're hinting at it in this JIRA)

I think you're referring to AVRO-160.  That's indeed what instigated this last week, but over the weekend I've since had second thoughts about actually using it there.  I still believe this is useful for some applications, but, if folks agree with me about AVRO-160, then committing this is not urgent.

> one alternative is to ditch the whole binary representation and store the original schema in Avro-encoded binary JSON.

I like that idea.  It'd be bigger than with the schema here, since all of the Avro keywords will be included, but it will still be considerably smaller and faster than textual JSON.  Plus the specification for JSON is much less likely to change, so its schema is likely to be much more stable and hence its less risky to assume that schema as a system constant.

> I think you would agree that using either the specific (my preference) API or the generic API would be clearer from a code perspective.

Perhaps a bit, but not much.  It adds an intermediate representation, which has some cognitive overhead, which this code does not.  This code instead requires some understanding of Avro's encoder/decoder API.  I don't think that would reduce the code size by more than perhaps 10%, and I don't think it would be much more robust.  Efficiently mapping the union branch classes to Schema subclasses would require something like a Map<Class,Schema.Type>.  This table could be built by processing the schema, rather than as this patch does by assuming that the Schema.Type enum is sync'd with the union.  But we could change this patch to build that mapping from the schema too if we are particularly concerned about that.

I actually generated the specific code first, and considered writing it that way, but it felt like more work to me.

> add schema for schemas
> ----------------------
>
>                 Key: AVRO-251
>                 URL: https://issues.apache.org/jira/browse/AVRO-251
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>         Attachments: AVRO-251.patch, AVRO-251.patch
>
>
> A schema for schemas would permits schemas to be written in binary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (AVRO-251) add schema for schemas

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-251:
------------------------------

    Status: Patch Available  (was: Open)

> add schema for schemas
> ----------------------
>
>                 Key: AVRO-251
>                 URL: https://issues.apache.org/jira/browse/AVRO-251
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>         Attachments: AVRO-251.patch, AVRO-251.patch
>
>
> A schema for schemas would permits schemas to be written in binary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-251) add schema for schemas

Posted by "Philip Zeyliger (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790298#action_12790298 ] 

Philip Zeyliger commented on AVRO-251:
--------------------------------------

Does the serialization and deserialization to binary schemas belong in Schema.java or does it belong in a nearby class?  I think the usecase for it (I know you have one in mind, and we're hinting at it in this JIRA) ought to be spelled out in the JavaDoc for the appropriate methods.

bq. Note that this currently does not preserve every nuance, e.g., user properties. So my vote is to remove default values as well.

If you're not preserving user properties, I'm +1 for killing the defaults.  This leaves us in a place where we have representations of schemas that, without other representations, we can't read data with.  (The way I think of it, we always need two schemas: the schema the data was written with, and the schema the data is being read with.  We can use the binary version for the former, but not the latter.  Is that right?  Do we have names for these two schemas?)

If you were inclined towards keeping the defaults, I would keep pushing for storing them as avro-encoded binary bytes.

bq. It's nice to see how little code is required to incorporate full JSON data into Avro.

Yes, that JSON itself has a small schema is re-assuring.  I'm +1 for taking this out of this patch, but separately producing a tool to represent "binary JSON" in Avro.

Just to be sure we've thought of it, one alternative is to ditch the whole binary representation and store the original schema in Avro-encoded binary JSON.  I actually prefer schemas to be typed.

bq. This "event based" programming style requires only a bit more coding than wrapper classes, but saves a level of redirection and/or copies.

I appreciate that with ValidatingEncoder we get a sense of security.  But I have a hard time buying the performance argument here.  I think you would agree that using either the specific (my preference) API or the generic API would be clearer from a code perspective.  If the performance of the specific API is crap, then we need to measure it and fix it: after all, that is the API Avro recommends people to use.  Considering that set of schemas in a program should have small cardinality, and the binary representation could be cached, speed doesn't seem paramount here.

I agree that event-based models are very useful for things that, say, don't fit into memory readily.  Schemas pretty much have to fit into memory readily, so I don't think the case applies here.
  

> add schema for schemas
> ----------------------
>
>                 Key: AVRO-251
>                 URL: https://issues.apache.org/jira/browse/AVRO-251
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>         Attachments: AVRO-251.patch, AVRO-251.patch
>
>
> A schema for schemas would permits schemas to be written in binary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-251) add schema for schemas

Posted by "Philip Zeyliger (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789825#action_12789825 ] 

Philip Zeyliger commented on AVRO-251:
--------------------------------------

I haven't had a chance to fully grok the changes to Schema.java yet (particularly the writeSchema() code), but here's what I've been reviewing so far...


bq. I wonder if we even need to serialize default values.

It depends on what the goal of a binary schema is.  If we want to preserve the entire nuance of the writing schema, we should keep the defaults.  In protocol buffers, it's sometimes the case that the default values aren't materialized, because a missing value gets read as the default (this assumes same read and write schema).  That's not the case here: it's actually valid to evolve a schema by changing the default value (this is not really valid in PBs).

bq. Schema.Field.equals() to use string comparison of default values

This deserves a quick comment in the code.  It's nuanced enough that someone might trip up over it later.

bq. names.space(savedSpace);                  // restore space

I suspect this was a bug before (not popping back the previous space).  I didn't see a test for it.

bq. avro/Schema.m4

I should finish off AVRO-152 so there could be comments in here.

One thing I noticed is that you don't store "other" fields.  For example, in AVRO-152 I introduce adding "doc" to records and protocols.  And, of course, you've been using the ability to write arbitrary extra fields to implement the reflection API.  I could see either storing or not storing extra stuff; I think it's quite similar to the question of whether or not to store defaults.

Do you think it would be useful to have a hook at the schema type to have a way to evolve it?  Should it be a union?

bq. src/schemata/org/json/Json.avsc: {"type": "record", "name": "Field",

I think you should call this "object" to follow the JSON terminology.  See http://www.json.org/fatfree.html .

I'm also not sure about putting this into "org.json".  "org.apache.avro.json", perhaps?

bq.  //System.out.println(new String(data, "UTF-8"));

Ditch it?



> add schema for schemas
> ----------------------
>
>                 Key: AVRO-251
>                 URL: https://issues.apache.org/jira/browse/AVRO-251
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>         Attachments: AVRO-251.patch, AVRO-251.patch
>
>
> A schema for schemas would permits schemas to be written in binary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-251) add schema for schemas

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835392#action_12835392 ] 

Scott Carey commented on AVRO-251:
----------------------------------

Yeah, in the file format case that is simpler.

In the general case of serializing a schema compactly, compression may be an option.  Schema could have a constructor or factory that takes a byte[] and a corresponding serializer method that returns a the schema as a set of compressed bytes.

> add schema for schemas
> ----------------------
>
>                 Key: AVRO-251
>                 URL: https://issues.apache.org/jira/browse/AVRO-251
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-251.patch, AVRO-251.patch
>
>
> A schema for schemas would permits schemas to be written in binary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (AVRO-251) add schema for schemas

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-251:
------------------------------

    Fix Version/s:     (was: 1.3.0)

> add schema for schemas
> ----------------------
>
>                 Key: AVRO-251
>                 URL: https://issues.apache.org/jira/browse/AVRO-251
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-251.patch, AVRO-251.patch
>
>
> A schema for schemas would permits schemas to be written in binary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-251) add schema for schemas

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790283#action_12790283 ] 

Doug Cutting commented on AVRO-251:
-----------------------------------

Thiru> it won't match the standard JSON version of the schema [ ... ]

Yes, but I don't expect folks to use this for anything but binary (which I should better document).  This schema representation is really only useful when schemas are written out frequently, and should then be written more efficiently than they can be with JSON.  It's not aesthetically ideal, but still useful, and I fear forcing the JSON encodings to correspond would be impractical for reasons you indicate.

Thiru> Since the "order" field of the record is optional, it should be a union of null and Enum in the schema.

It's not optional.  It has a default value in the Java API, but it's always specified for every field.  If it were such a union then its binary encoding would be larger.

Philip> If we want to preserve the entire nuance of the writing schema, we should keep the defaults.

Note that this currently does not preserve every nuance, e.g., user properties.  So my vote is to remove default values as well.

A container that stores lots of binary schemas would be wise to include one copy of the JSON schema-for-schemas, so that, if that meta-schema changes, the data can still be processed.  This representation should contain only what's required to decode data, and its documentation should note that.  It should only contain things that are required of "actual" schemas, and applications can interpret with an "expected" schema that contains default values, user properties, etc.

Philip> I suspect this was a bug before (not popping back the previous space). I didn't see a test for it.

Good point.  Yes, it is a pre-existing bug triggered here by the inclusion of the Json schema from a different namespace in the Schema schema, so this patch does add a test, but not an explicit one.  I've filed AVRO-255 for this.  (This schema also identified AVRO-256.)

Philip> Do you think it would be useful to have a hook at the schema type to have a way to evolve it? Should it be a union?

It is in fact a union, wrapped in a record to give it a name.  But, no, I think the best way to evolve it is for containers to keep a copy of the meta schema, as with any other and build on Avro's evolution mechanisms.  Worst case, if an application neglects to store a copy and this schema changes in a subsequent release, the application can always go back to the prior release and retrieve it and use it as the "actual" schema while using the new release's version as "expected".

Philip> I think you should call this "object" to follow the JSON terminology.

It's not a JSON object but a "key value pair" within an object.  So perhaps we should call it KeyValuePair?  Note though that if we drop default values we don't need this schema here anyway.  This code could however still be useful for a command-line tool that takes arbitrary Json input and encodes it in an Avro data file, and then for MapReduce programs that process Json-format generic data.  It's nice to see how little code is required to incorporate full JSON data into Avro.

Philip> From a maintainability perspective, it scares me that Type.values() has to be kept in sync with the order of the types in the union in Schema.m4

There are two things that protect us here:
 - We have unit tests that attempt to exhaustively test this encoding/decoding code.  So it's in that sense its as safe as our DatumReader/DatumWriter implementations.  But if this were our only protection, it would not be a programming style we'd recommend for applications.  Except,
 - We have ValidatingEncoder and ValidatingDecoder.  Folks who use this programming style who do not develop exhaustive test suites should be strongly advised to only use it with ValidatingEncoder and ValidatingDecoder.  When I initially wrote this code there were a few errors (I forgot to read/write null values) that were caught by the validator.

This "event based" programming style requires only a bit more coding than wrapper classes, but saves a level of redirection and/or copies.  It also works well for applications whose serialized data does not correspond directly to in-memory structures (e.g., a streaming application with array values too large to fit in memory).  The validators make it practical.

Philip> It seems that instead of string, it could be a record (that contains a string) called "reference"

That's a good idea.  Using the event-style, the code need not actually change, nor would the serialized representation, only the schema.


> add schema for schemas
> ----------------------
>
>                 Key: AVRO-251
>                 URL: https://issues.apache.org/jira/browse/AVRO-251
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>         Attachments: AVRO-251.patch, AVRO-251.patch
>
>
> A schema for schemas would permits schemas to be written in binary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-251) add schema for schemas

Posted by "Thiruvalluvan M. G. (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789741#action_12789741 ] 

Thiruvalluvan M. G. commented on AVRO-251:
------------------------------------------

It's unfortunate that if you serialize an Avro schema using JSON encoder and schema for schema, it won't match the standard JSON version of the schema. I don't think it'll be possible to fix the JSON encoder to match the standard JSON version. But we can do the other way around. Though it's desirable to have this property, it'll be cause incompatibility with the existing schemas and I think the schema will become harder for humans to read.

Since the "order" field of the record is optional, it should be a union of null and Enum in the schema.


> add schema for schemas
> ----------------------
>
>                 Key: AVRO-251
>                 URL: https://issues.apache.org/jira/browse/AVRO-251
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>         Attachments: AVRO-251.patch, AVRO-251.patch
>
>
> A schema for schemas would permits schemas to be written in binary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (AVRO-251) add schema for schemas

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-251:
------------------------------

    Attachment: AVRO-251.patch

Here's a new, complete version of this.  It directly encodes and decodes schemas.

JsonNode#equals() doesn't seem to do what I expect, so I had to modify Schema.Field.equals() to use string comparison of default values.  I wonder if we even need to serialize default values.  The typical application of this code is to efficiently store the schema used to write data with that data.  In this case, the default value is never used and just adds baggage.  (Default values are only used when reading data written by a schema that lacks that value.)  This patch would also be simplified somewhat if we didn't need to read/write arbitrary json for default values, although it might be a shame to discard this json-in-avro implementation.


> add schema for schemas
> ----------------------
>
>                 Key: AVRO-251
>                 URL: https://issues.apache.org/jira/browse/AVRO-251
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>         Attachments: AVRO-251.patch, AVRO-251.patch
>
>
> A schema for schemas would permits schemas to be written in binary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-251) add schema for schemas

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835132#action_12835132 ] 

Scott Carey commented on AVRO-251:
----------------------------------

I am using DataFileReader/Writer and the header is about 5K in size because the whole schema is in text.

I'm not sure if the approach in this ticket is best for the file format, but some way to persist a schema in a compact form would be useful.  A binary format would be smaller, but every field and type would still have to be there in text.  Maybe, for the data file we could just store the schema as the string, deflate compressed.  That might be computationally more expensive for a compact schema representation, but it could be clean in general -- if the first character in a byte[] that represents a schema is a special marker value (that is invalid in JSON), then the remaining bytes are compressed json, otherwise its utf-8 json.

My largest schema is 6.3k as a string including whitespace 'pretty printed', and 4.9k without whitespace as printed by Schema.toString().
It is 1.3k compressed by gzip -5 or higher,  and 1.5k by gzip -1.

> add schema for schemas
> ----------------------
>
>                 Key: AVRO-251
>                 URL: https://issues.apache.org/jira/browse/AVRO-251
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-251.patch, AVRO-251.patch
>
>
> A schema for schemas would permits schemas to be written in binary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (AVRO-251) add schema for schemas

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-251:
------------------------------

    Attachment: AVRO-251.patch

Here's an early version of this.  It defines org.apache.avro.meta.SchemaDef.  We still need a method that converts this to an org.apache.avro.Schema, which should not be hard to write.

Alternately, we could rely on ValidatingEncoder and ValidatingDecoder, and add methods directly to Schema.java to read/write a schema in binary according to this schema declaration.

> add schema for schemas
> ----------------------
>
>                 Key: AVRO-251
>                 URL: https://issues.apache.org/jira/browse/AVRO-251
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>         Attachments: AVRO-251.patch
>
>
> A schema for schemas would permits schemas to be written in binary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-251) add schema for schemas

Posted by "Philip Zeyliger (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789949#action_12789949 ] 

Philip Zeyliger commented on AVRO-251:
--------------------------------------

More thoughts:

My main reservation here is using the Decoder/Encoder API (instead of a higher-level one) for decoding and encoding the binary form.

If you're curious, Protocol Buffer's version of the PB that describes PBs is at http://code.google.com/p/protobuf/source/browse/trunk/src/google/protobuf/descriptor.proto

bq. Default values

Thinking about *default values* once more, I think you could encode them in binary.  Store them in a "bytes" type, and force users to decode them according to the schema that the default value corresponds to.  More compact, and no JSON conceptual dependency.  (This pattern is generally sometimes useful: "dynamic schemas" within an avro file.

bq. public abstract void writeIndex(int unionIndex)

Is Encoder.java a public api?  Is it too late to rename writeIndex to writeUnionTag?  Had to scratch my head there until I looked at the code.

bq. private static final int NAME_INDEX = Type.values().length;

>From a maintainability perspective, it scares me that Type.values() has to be kept in sync with the order of the types in the union in Schema.m4.  The "string" that corresponds to NAME_INDEX in Schema.m4 deserves a comment; it's a pretty different category than the others.  It seems that instead of string, it could be a record (that contains a string) called "reference", or just use a comment.

bq. writeSchema...

This method uses the Encoder API to write an AVRO record.  Is there a reason not to use the Specific or Generic APIs instead?

I think the same thing about readSchema().

bq. readJson()

Is this method decoding AVRO's json schema into JSON objects?  You should spell that out more clearly; there are a lot of different things json could refer to in this class.



> add schema for schemas
> ----------------------
>
>                 Key: AVRO-251
>                 URL: https://issues.apache.org/jira/browse/AVRO-251
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>         Attachments: AVRO-251.patch, AVRO-251.patch
>
>
> A schema for schemas would permits schemas to be written in binary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-251) add schema for schemas

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835330#action_12835330 ] 

Doug Cutting commented on AVRO-251:
-----------------------------------

Might we instead add a metadata field like, "avro.schema.codec"?

> add schema for schemas
> ----------------------
>
>                 Key: AVRO-251
>                 URL: https://issues.apache.org/jira/browse/AVRO-251
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-251.patch, AVRO-251.patch
>
>
> A schema for schemas would permits schemas to be written in binary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.