You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@thrift.apache.org by Bryan Duxbury <br...@rapleaf.com> on 2009/04/03 18:24:26 UTC
Re: [PROPOSAL] new subproject: Avro
It sounds like what you want is the option avoid pre-generated
classes. If that's the only thing you need, it seems like we could
bolt that on to Thrift with almost no work. I assume you'd have the
schema stored in metadata or file header or something, right? (You
wouldn't want to store the field names in the binary encoding as
strings, since that would probably very quickly dwarf the size of the
actual data in a lot of cases.)
If my assumptions are correct, it seems like it'd be a lot smarter to
leverage existing Thrift infrastructure and encoding work rather than
duplicating it for this lone feature.
-Bryan
On Apr 3, 2009, at 9:06 AM, Doug Cutting wrote:
> Owen O'Malley wrote:
>> 2. Protocol buffers (and thrift) encode the field names as id
>> numbers. That means that if you read them into dynamic language
>> like Python that it has to use the field numbers instead of the
>> field names. In Avro, the field names are saved and there are no
>> field ids.
>
> This hints at a related problem with Thrift and Protocol Buffers,
> which is that they require one to generate code for each datatype
> one processes. This is awkward in dynamic environments, where one
> would like to write a script (Pig, Python, Perl, Hive, whatever) to
> process input data and generate output data, without having to
> locate the IDL for each input file, run an IDL compiler, load the
> generated code, generate an IDL file for the output, run the
> compiler again, load the output code and finally write your
> output. Avro rather lets you simply open your inputs, examine
> their datatypes, specify output types and write them.
>
> Avro's Java implementation currently includes three different data
> representations:
>
> - a "generic" representation uses a standard set of datastructures
> for all datatypes: records are represented as Map<String,Object>,
> arrays as List<Object>, longs as Long, etc.
>
> - a "reflect" representation uses Java reflection to permit one to
> read and write existing Java classes with Avro.
>
> - a "specific" representation generates Java classes that are
> compiled and loaded, much like Thrift and Protocol Buffers.
>
> We don't expect most scripting languages to use more than a single
> representation. Implementing Avro is quite simple, by design. We
> have a Python implementation, and hope to add more soon.
>
> Doug
Re: [PROPOSAL] new subproject: Avro
Posted by Bryan Duxbury <br...@rapleaf.com>.
> Field ids are not present in Avro data except in the schema. A
> record's fields are serialized in the order that the fields occur
> in the records schema, with no per-field annotations whatsoever.
> For example, a record that contains a string and an int is
> serialized simply as a string followed by an int, nothing before,
> nothing between and nothing after. So, yes, it is a different data
> format.
So you can't serialize nulls? It also seems like this would make
forward/backward compatibility a little more complex. Thrift solves
this problem by using tags to indicate what kind of field you're
working with.
-Bryan
Re: [PROPOSAL] new subproject: Avro
Posted by George Porter <Ge...@Sun.COM>.
On Apr 3, 2009, at 1:02 PM, Doug Cutting wrote:
> George Porter wrote:
>> While this representation would certainly be as compact as
>> possible, wouldn't it prevent evolving the data structure over
>> time? One of the nice features of Google Protocol Buffers and
>> Thrift is that you can evolve the set of fields over time, and
>> older/newer clients can talk to older/newer services. If the
>> proposed Avro is evolvable, then perhaps I'm misunderstanding your
>> statement about the lack of IDs in the serialized data.
>
> Avro supports schema evolution. In Avro, the schema used to write
> the data must be available when the data is read. (In files, it is
> typically stored in the file metadata.)
>
> If you have the schema that was used to write the data, and you're
> expecting a slightly different schema, then you simply keep those
> fields that are in both schemas and skip those not. This is
> equivalent to Thrift and Protocol Buffer's support for schema
> evolution, but does not require manually assigning numeric field ids.
>
> This feature can also be used to support projection. If you have
> records with many large fields, but only need a single field in a
> particular computation, then you can specify an expected schema with
> only that field, and the runtime will efficiently skip all of the
> other fields, returning a record with just the single, expected field.
Thanks for the clarification--I better understand the schema
relationship. The projection feature is a nice feature, especially
since it seems like it would be able to support "sparse files" where
you want to just peek at large structs without invoking a lot of disk-
io (for data serialized on-disk).
>
>
>> I also agree with Bryan, in that it would be unfortunate to have
>> two different Apache projects with overlapping goals.
>
> We already have both Thrift and Etch in the incubator, which have
> similar goals. Apache does not attempt to mandate that projects
> have disjoint goals. There are many ways to slice things, and
> Apache prefers to rely on survival of the fittest rather than
> forcing things together.
>
>> Regardless of features, both protocol buffers and thrift have the
>> advantage of being debugged in mission-critical production
>> environments.
>
> Yes, but, as I've argued in other messages in this thread, they do
> not support the dynamic features we need. Adding those features
> would add new code that would share little with existing code in
> those projects. So, while the projects are conceptually similar, the
> implementations are necessarily different, and, without significant
> code overlap, separate projects seem more natural.
>
> Doug
Makes sense. Thanks,
George
Re: [PROPOSAL] new subproject: Avro
Posted by Doug Cutting <cu...@apache.org>.
George Porter wrote:
> While this representation would certainly be as compact as possible,
> wouldn't it prevent evolving the data structure over time? One of the
> nice features of Google Protocol Buffers and Thrift is that you can
> evolve the set of fields over time, and older/newer clients can talk to
> older/newer services. If the proposed Avro is evolvable, then perhaps
> I'm misunderstanding your statement about the lack of IDs in the
> serialized data.
Avro supports schema evolution. In Avro, the schema used to write the
data must be available when the data is read. (In files, it is
typically stored in the file metadata.)
If you have the schema that was used to write the data, and you're
expecting a slightly different schema, then you simply keep those fields
that are in both schemas and skip those not. This is equivalent to
Thrift and Protocol Buffer's support for schema evolution, but does not
require manually assigning numeric field ids.
This feature can also be used to support projection. If you have
records with many large fields, but only need a single field in a
particular computation, then you can specify an expected schema with
only that field, and the runtime will efficiently skip all of the other
fields, returning a record with just the single, expected field.
> I also agree with Bryan, in that it would be unfortunate to have two
> different Apache projects with overlapping goals.
We already have both Thrift and Etch in the incubator, which have
similar goals. Apache does not attempt to mandate that projects have
disjoint goals. There are many ways to slice things, and Apache prefers
to rely on survival of the fittest rather than forcing things together.
> Regardless of
> features, both protocol buffers and thrift have the advantage of being
> debugged in mission-critical production environments.
Yes, but, as I've argued in other messages in this thread, they do not
support the dynamic features we need. Adding those features would add
new code that would share little with existing code in those projects.
So, while the projects are conceptually similar, the implementations are
necessarily different, and, without significant code overlap, separate
projects seem more natural.
Doug
Re: [PROPOSAL] new subproject: Avro
Posted by Scott Carey <sc...@richrelevance.com>.
On 4/3/09 12:03 PM, "George Porter" <Ge...@Sun.COM> wrote:
>
>
> On Apr 3, 2009, at 11:37 AM, Doug Cutting wrote:
>>>
>>
>> Field ids are not present in Avro data except in the schema. A
>> record's fields are serialized in the order that the fields occur in
>> the records schema, with no per-field annotations whatsoever. For
>> example, a record that contains a string and an int is serialized
>> simply as a string followed by an int, nothing before, nothing
>> between and nothing after. So, yes, it is a different data format.
>
> While this representation would certainly be as compact as possible,
> wouldn't it prevent evolving the data structure over time? One of the
> nice features of Google Protocol Buffers and Thrift is that you can
> evolve the set of fields over time, and older/newer clients can talk
> to older/newer services. If the proposed Avro is evolvable, then
> perhaps I'm misunderstanding your statement about the lack of IDs in
> the serialized data.
>From a quick perusal of the serialization format -- it contains headers with
type/schema information, and other metadata blocks. The types can be
inferred from this, and if this is done right then older/newer clients will
be able to read things just fine. What can't be done is mixing two
different formats in the same stream if headers define the format of the
whole stream.
I have not looked much deeper than that, but it looks like schema evolution
is feasible.
>
> I also agree with Bryan, in that it would be unfortunate to have two
> different Apache projects with overlapping goals. Regardless of
> features, both protocol buffers and thrift have the advantage of being
> debugged in mission-critical production environments.
>
> -George
>
Re: [PROPOSAL] new subproject: Avro
Posted by George Porter <Ge...@Sun.COM>.
On Apr 3, 2009, at 11:37 AM, Doug Cutting wrote:
>>
>
> Field ids are not present in Avro data except in the schema. A
> record's fields are serialized in the order that the fields occur in
> the records schema, with no per-field annotations whatsoever. For
> example, a record that contains a string and an int is serialized
> simply as a string followed by an int, nothing before, nothing
> between and nothing after. So, yes, it is a different data format.
While this representation would certainly be as compact as possible,
wouldn't it prevent evolving the data structure over time? One of the
nice features of Google Protocol Buffers and Thrift is that you can
evolve the set of fields over time, and older/newer clients can talk
to older/newer services. If the proposed Avro is evolvable, then
perhaps I'm misunderstanding your statement about the lack of IDs in
the serialized data.
I also agree with Bryan, in that it would be unfortunate to have two
different Apache projects with overlapping goals. Regardless of
features, both protocol buffers and thrift have the advantage of being
debugged in mission-critical production environments.
-George
Re: [PROPOSAL] new subproject: Avro
Posted by Doug Cutting <cu...@apache.org>.
Bryan Duxbury wrote:
> It's not actually a different data format, is it? You're saying that the
> user wouldn't specify the field IDs, but you'd fundamentally still use
> field ids for compactness and the like.
Field ids are not present in Avro data except in the schema. A record's
fields are serialized in the order that the fields occur in the records
schema, with no per-field annotations whatsoever. For example, a record
that contains a string and an int is serialized simply as a string
followed by an int, nothing before, nothing between and nothing after.
So, yes, it is a different data format.
> The bottom line is that I would love to see greater cooperation between
> Hadoop and Thrift. Unless it's impossible or impractical for Thrift to
> be useful here, I think we'd be willing to work towards Hadoop's needs.
Perhaps Thrift could be augmented to support Avro's JSON schemas and
serialization. Then it could interoperate with other Avro-based
systems. But then Thrift would have yet another serialization format,
that every language would need to implement for it to be useful...
Avro will only ever have one serialization format. Thrift fundamentally
standardizes an API, not a data format. Avro fundamentally is a data
format specification, like XML. Thrift could implement this
specification. The Avro project includes reference implementations, but
the format is intended to be simple enough and the specification stable
enough that others might reasonably develop alternate, independent
implementations.
Doug
Re: [PROPOSAL] new subproject: Avro
Posted by Bryan Duxbury <br...@rapleaf.com>.
> With the schema in hand, you don't need to tag data with field
> numbers or types, since that's all there in the schema. So, having
> the schema, you can use a simpler data format.
To a degree, we already have that in Thrift - we call it the
DenseProtocol.
> Would you write parsers for Thrift's IDL in every language? Or
> would you use JSON, as Avro does, to avoid that?
When it comes to having a code-usable IDL for the schema, I'm totally
pro-JSON.
> Once you're using a different IDL and a different data format,
> what's shared with Thrift? Fundamentally, those two things define
> a serialization system, no?
It's not actually a different data format, is it? You're saying that
the user wouldn't specify the field IDs, but you'd fundamentally
still use field ids for compactness and the like. You may not use
actual Thrift generated objects, but you could certainly use Binary
or Compact protocol from Thrift to do all the writing to the wire.
You might also be able to use (or contribute to) Thrift's RPC-level
stuff like server implementations. We have some respectable Java
servers written, and if those aren't enough for your uses, I'd
actually be really interested in seeing if we could generalize some
of the Hadoop stuff to be useful within Thrift.
The bottom line is that I would love to see greater cooperation
between Hadoop and Thrift. Unless it's impossible or impractical for
Thrift to be useful here, I think we'd be willing to work towards
Hadoop's needs.
-Bryan
Re: [PROPOSAL] new subproject: Avro
Posted by Doug Cutting <cu...@apache.org>.
Bryan Duxbury wrote:
> It sounds like what you want is the option avoid pre-generated classes.
That's part of it. But, once you have the schema, you might as well
take advantage of it.
With the schema in hand, you don't need to tag data with field numbers
or types, since that's all there in the schema. So, having the schema,
you can use a simpler data format.
Also, with the schema, resolving version differences is simplified.
Developers don't need to assign field numbers, but can just use names.
For performance, one can internally use field numbers while reading, to
avoid string comparisons, but developers need no longer specify these,
but can use names, as in most software. Here having the schema means we
can simplify the IDL and its versioning semantics.
> If that's the only thing you need, it seems like we could bolt that on
> to Thrift with almost no work.
Would you write parsers for Thrift's IDL in every language? Or would
you use JSON, as Avro does, to avoid that?
Once you're using a different IDL and a different data format, what's
shared with Thrift? Fundamentally, those two things define a
serialization system, no?
> I assume you'd have the schema stored in
> metadata or file header or something, right? (You wouldn't want to store
> the field names in the binary encoding as strings, since that would
> probably very quickly dwarf the size of the actual data in a lot of cases.)
Yes, in data files the schema is typically stored in the metadata.
> If my assumptions are correct, it seems like it'd be a lot smarter to
> leverage existing Thrift infrastructure and encoding work rather than
> duplicating it for this lone feature.
What specific shared infrastructure would be leveraged? For Hadoop's
RPC, I hope to adapt Hadoop's client and server implementations as a
transport, as these have been highly tuned for Hadoop's performance
requirements.
Doug
Re: [PROPOSAL] new subproject: Avro
Posted by Doug Cutting <cu...@apache.org>.
I have responded to this on general at hadoop.apache.org. Let's not
cross-post, but rather keep this discussion on a single list. I think
general@ is the best list for this discussion.
http://mail-archives.apache.org/mod_mbox/hadoop-general/200904.mbox/browser
Thanks,
Doug