You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@thrift.apache.org by Bryan Duxbury <br...@rapleaf.com> on 2009/04/03 18:24:26 UTC

Re: [PROPOSAL] new subproject: Avro

It sounds like what you want is the option avoid pre-generated  
classes. If that's the only thing you need, it seems like we could  
bolt that on to Thrift with almost no work. I assume you'd have the  
schema stored in metadata or file header or something, right? (You  
wouldn't want to store the field names in the binary encoding as  
strings, since that would probably very quickly dwarf the size of the  
actual data in a lot of cases.)

If my assumptions are correct, it seems like it'd be a lot smarter to  
leverage existing Thrift infrastructure and encoding work rather than  
duplicating it for this lone feature.

-Bryan

On Apr 3, 2009, at 9:06 AM, Doug Cutting wrote:

> Owen O'Malley wrote:
>> 2. Protocol buffers (and thrift) encode the field names as id  
>> numbers. That means that if you read them into dynamic language  
>> like Python that it has to use the field numbers instead of the  
>> field names. In Avro, the field names are saved and there are no  
>> field ids.
>
> This hints at a related problem with Thrift and Protocol Buffers,  
> which is that they require one to generate code for each datatype  
> one processes.  This is awkward in dynamic environments, where one  
> would like to write a script (Pig, Python, Perl, Hive, whatever) to  
> process input data and generate output data, without having to  
> locate the IDL for each input file, run an IDL compiler, load the  
> generated code, generate an IDL file for the output, run the  
> compiler again, load the output code and finally write your  
> output.  Avro rather lets you simply open your inputs, examine  
> their datatypes, specify output types and write them.
>
> Avro's Java implementation currently includes three different data  
> representations:
>
>  - a "generic" representation uses a standard set of datastructures  
> for all datatypes: records are represented as Map<String,Object>,  
> arrays as List<Object>, longs as Long, etc.
>
>  - a "reflect" representation uses Java reflection to permit one to  
> read and write existing Java classes with Avro.
>
>  - a "specific" representation generates Java classes that are  
> compiled and loaded, much like Thrift and Protocol Buffers.
>
> We don't expect most scripting languages to use more than a single  
> representation.  Implementing Avro is quite simple, by design.  We  
> have a Python implementation, and hope to add more soon.
>
> Doug

Re: [PROPOSAL] new subproject: Avro

Posted by Bryan Duxbury <br...@rapleaf.com>.

> Field ids are not present in Avro data except in the schema.  A  
> record's fields are serialized in the order that the fields occur  
> in the records schema, with no per-field annotations whatsoever.   
> For example, a record that contains a string and an int is  
> serialized simply as a string followed by an int, nothing before,  
> nothing between and nothing after. So, yes, it is a different data  
> format.

So you can't serialize nulls? It also seems like this would make  
forward/backward compatibility a little more complex. Thrift solves  
this problem by using tags to indicate what kind of field you're  
working with.

-Bryan

Re: [PROPOSAL] new subproject: Avro

Posted by George Porter <Ge...@Sun.COM>.

On Apr 3, 2009, at 1:02 PM, Doug Cutting wrote:

> George Porter wrote:
>> While this representation would certainly be as compact as  
>> possible, wouldn't it prevent evolving the data structure over  
>> time?  One of the nice features of Google Protocol Buffers and  
>> Thrift is that you can evolve the set of fields over time, and  
>> older/newer clients can talk to older/newer services.  If the  
>> proposed Avro is evolvable, then perhaps I'm misunderstanding your  
>> statement about the lack of IDs in the serialized data.
>
> Avro supports schema evolution.  In Avro, the schema used to write  
> the data must be available when the data is read.  (In files, it is  
> typically stored in the file metadata.)
>
> If you have the schema that was used to write the data, and you're  
> expecting a slightly different schema, then you simply keep those  
> fields that are in both schemas and skip those not.  This is  
> equivalent to Thrift and Protocol Buffer's support for schema  
> evolution, but does not require manually assigning numeric field ids.
>
> This feature can also be used to support projection.  If you have  
> records with many large fields, but only need a single field in a  
> particular computation, then you can specify an expected schema with  
> only that field, and the runtime will efficiently skip all of the  
> other fields, returning a record with just the single, expected field.

Thanks for the clarification--I better understand the schema  
relationship.  The projection feature is a nice feature, especially  
since it seems like it would be able to support "sparse files" where  
you want to just peek at large structs without invoking a lot of disk- 
io (for data serialized on-disk).

>
>
>> I also agree with Bryan, in that it would be unfortunate to have  
>> two different Apache projects with overlapping goals.
>
> We already have both Thrift and Etch in the incubator, which have  
> similar goals.  Apache does not attempt to mandate that projects  
> have disjoint goals.  There are many ways to slice things, and  
> Apache prefers to rely on survival of the fittest rather than  
> forcing things together.
>
>> Regardless of features, both protocol buffers and thrift have the  
>> advantage of being debugged in mission-critical production  
>> environments.
>
> Yes, but, as I've argued in other messages in this thread, they do  
> not support the dynamic features we need.  Adding those features  
> would add new code that would share little with existing code in  
> those projects. So, while the projects are conceptually similar, the  
> implementations are necessarily different, and, without significant  
> code overlap, separate projects seem more natural.
>
> Doug

Makes sense.  Thanks,
George

Re: [PROPOSAL] new subproject: Avro

Posted by Doug Cutting <cu...@apache.org>.

George Porter wrote:
> While this representation would certainly be as compact as possible, 
> wouldn't it prevent evolving the data structure over time?  One of the 
> nice features of Google Protocol Buffers and Thrift is that you can 
> evolve the set of fields over time, and older/newer clients can talk to 
> older/newer services.  If the proposed Avro is evolvable, then perhaps 
> I'm misunderstanding your statement about the lack of IDs in the 
> serialized data.

Avro supports schema evolution.  In Avro, the schema used to write the 
data must be available when the data is read.  (In files, it is 
typically stored in the file metadata.)

If you have the schema that was used to write the data, and you're 
expecting a slightly different schema, then you simply keep those fields 
that are in both schemas and skip those not.  This is equivalent to 
Thrift and Protocol Buffer's support for schema evolution, but does not 
require manually assigning numeric field ids.

This feature can also be used to support projection.  If you have 
records with many large fields, but only need a single field in a 
particular computation, then you can specify an expected schema with 
only that field, and the runtime will efficiently skip all of the other 
fields, returning a record with just the single, expected field.

> I also agree with Bryan, in that it would be unfortunate to have two 
> different Apache projects with overlapping goals.

We already have both Thrift and Etch in the incubator, which have 
similar goals.  Apache does not attempt to mandate that projects have 
disjoint goals.  There are many ways to slice things, and Apache prefers 
to rely on survival of the fittest rather than forcing things together.

> Regardless of 
> features, both protocol buffers and thrift have the advantage of being 
> debugged in mission-critical production environments.

Yes, but, as I've argued in other messages in this thread, they do not 
support the dynamic features we need.  Adding those features would add 
new code that would share little with existing code in those projects. 
So, while the projects are conceptually similar, the implementations are 
necessarily different, and, without significant code overlap, separate 
projects seem more natural.

Doug

Re: [PROPOSAL] new subproject: Avro

Posted by Scott Carey <sc...@richrelevance.com>.

On 4/3/09 12:03 PM, "George Porter" <Ge...@Sun.COM> wrote:

> 
> 
> On Apr 3, 2009, at 11:37 AM, Doug Cutting wrote:
>>> 
>> 
>> Field ids are not present in Avro data except in the schema.  A
>> record's fields are serialized in the order that the fields occur in
>> the records schema, with no per-field annotations whatsoever.  For
>> example, a record that contains a string and an int is serialized
>> simply as a string followed by an int, nothing before, nothing
>> between and nothing after. So, yes, it is a different data format.
> 
> While this representation would certainly be as compact as possible,
> wouldn't it prevent evolving the data structure over time?  One of the
> nice features of Google Protocol Buffers and Thrift is that you can
> evolve the set of fields over time, and older/newer clients can talk
> to older/newer services.  If the proposed Avro is evolvable, then
> perhaps I'm misunderstanding your statement about the lack of IDs in
> the serialized data.

>From a quick perusal of the serialization format -- it contains headers with
type/schema information, and other metadata blocks.  The types can be
inferred from this, and if this is done right then older/newer clients will
be able to read things just fine.  What can't be done is mixing two
different formats in the same stream if headers define the format of the
whole stream.

I have not looked much deeper than that, but it looks like schema evolution
is feasible.

> 
> I also agree with Bryan, in that it would be unfortunate to have two
> different Apache projects with overlapping goals.  Regardless of
> features, both protocol buffers and thrift have the advantage of being
> debugged in mission-critical production environments.
> 
> -George
>

Re: [PROPOSAL] new subproject: Avro

Posted by George Porter <Ge...@Sun.COM>.

On Apr 3, 2009, at 11:37 AM, Doug Cutting wrote:
>>
>
> Field ids are not present in Avro data except in the schema.  A  
> record's fields are serialized in the order that the fields occur in  
> the records schema, with no per-field annotations whatsoever.  For  
> example, a record that contains a string and an int is serialized  
> simply as a string followed by an int, nothing before, nothing  
> between and nothing after. So, yes, it is a different data format.

While this representation would certainly be as compact as possible,  
wouldn't it prevent evolving the data structure over time?  One of the  
nice features of Google Protocol Buffers and Thrift is that you can  
evolve the set of fields over time, and older/newer clients can talk  
to older/newer services.  If the proposed Avro is evolvable, then  
perhaps I'm misunderstanding your statement about the lack of IDs in  
the serialized data.

I also agree with Bryan, in that it would be unfortunate to have two  
different Apache projects with overlapping goals.  Regardless of  
features, both protocol buffers and thrift have the advantage of being  
debugged in mission-critical production environments.

-George

Re: [PROPOSAL] new subproject: Avro

Posted by Doug Cutting <cu...@apache.org>.

Bryan Duxbury wrote:
> It's not actually a different data format, is it? You're saying that the 
> user wouldn't specify the field IDs, but you'd fundamentally still use 
> field ids for compactness and the like.

Field ids are not present in Avro data except in the schema.  A record's 
fields are serialized in the order that the fields occur in the records 
schema, with no per-field annotations whatsoever.  For example, a record 
that contains a string and an int is serialized simply as a string 
followed by an int, nothing before, nothing between and nothing after. 
So, yes, it is a different data format.

> The bottom line is that I would love to see greater cooperation between 
> Hadoop and Thrift. Unless it's impossible or impractical for Thrift to 
> be useful here, I think we'd be willing to work towards Hadoop's needs.

Perhaps Thrift could be augmented to support Avro's JSON schemas and 
serialization.  Then it could interoperate with other Avro-based 
systems.  But then Thrift would have yet another serialization format, 
that every language would need to implement for it to be useful...

Avro will only ever have one serialization format.  Thrift fundamentally 
standardizes an API, not a data format.  Avro fundamentally is a data 
format specification, like XML.  Thrift could implement this 
specification.  The Avro project includes reference implementations, but 
the format is intended to be simple enough and the specification stable 
enough that others might reasonably develop alternate, independent 
implementations.

Doug

Re: [PROPOSAL] new subproject: Avro

Posted by Bryan Duxbury <br...@rapleaf.com>.

> With the schema in hand, you don't need to tag data with field  
> numbers or types, since that's all there in the schema.  So, having  
> the schema, you can use a simpler data format.

To a degree, we already have that in Thrift - we call it the  
DenseProtocol.

> Would you write parsers for Thrift's IDL in every language?  Or  
> would you use JSON, as Avro does, to avoid that?

When it comes to having a code-usable IDL for the schema, I'm totally  
pro-JSON.

> Once you're using a different IDL and a different data format,  
> what's shared with Thrift?  Fundamentally, those two things define  
> a serialization system, no?

It's not actually a different data format, is it? You're saying that  
the user wouldn't specify the field IDs, but you'd fundamentally  
still use field ids for compactness and the like. You may not use  
actual Thrift generated objects, but you could certainly use Binary  
or Compact protocol from Thrift to do all the writing to the wire.

You might also be able to use (or contribute to) Thrift's RPC-level  
stuff like server implementations. We have some respectable Java  
servers written, and if those aren't enough for your uses, I'd  
actually be really interested in seeing if we could generalize some  
of the Hadoop stuff to be useful within Thrift.

The bottom line is that I would love to see greater cooperation  
between Hadoop and Thrift. Unless it's impossible or impractical for  
Thrift to be useful here, I think we'd be willing to work towards  
Hadoop's needs.

-Bryan

Re: [PROPOSAL] new subproject: Avro

Posted by Doug Cutting <cu...@apache.org>.

Bryan Duxbury wrote:
> It sounds like what you want is the option avoid pre-generated classes.

That's part of it.  But, once you have the schema, you might as well 
take advantage of it.

With the schema in hand, you don't need to tag data with field numbers 
or types, since that's all there in the schema.  So, having the schema, 
you can use a simpler data format.

Also, with the schema, resolving version differences is simplified. 
Developers don't need to assign field numbers, but can just use names. 
For performance, one can internally use field numbers while reading, to 
avoid string comparisons, but developers need no longer specify these, 
but can use names, as in most software.  Here having the schema means we 
can simplify the IDL and its versioning semantics.

> If that's the only thing you need, it seems like we could bolt that on 
> to Thrift with almost no work.

Would you write parsers for Thrift's IDL in every language?  Or would 
you use JSON, as Avro does, to avoid that?

Once you're using a different IDL and a different data format, what's 
shared with Thrift?  Fundamentally, those two things define a 
serialization system, no?

> I assume you'd have the schema stored in 
> metadata or file header or something, right? (You wouldn't want to store 
> the field names in the binary encoding as strings, since that would 
> probably very quickly dwarf the size of the actual data in a lot of cases.)

Yes, in data files the schema is typically stored in the metadata.

> If my assumptions are correct, it seems like it'd be a lot smarter to 
> leverage existing Thrift infrastructure and encoding work rather than 
> duplicating it for this lone feature.

What specific shared infrastructure would be leveraged?  For Hadoop's 
RPC, I hope to adapt Hadoop's client and server implementations as a 
transport, as these have been highly tuned for Hadoop's performance 
requirements.

Doug

Re: [PROPOSAL] new subproject: Avro

Posted by Doug Cutting <cu...@apache.org>.

I have responded to this on general at hadoop.apache.org.  Let's not 
cross-post, but rather keep this discussion on a single list.  I think 
general@ is the best list for this discussion.

http://mail-archives.apache.org/mod_mbox/hadoop-general/200904.mbox/browser

Thanks,

Doug