You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@thrift.apache.org by Jeff Hammerbacher <ha...@cloudera.com> on 2009/04/03 04:19:03 UTC

Avro, a cross-language serialization framework from Doug Cutting, proposed as Hadoop subproject

See http://markmail.org/thread/7cgrwoc4er4mr3bp

Re: Avro, a cross-language serialization framework from Doug Cutting, proposed as Hadoop subproject

Posted by Doug Cutting <cu...@apache.org>.

David Reiss wrote:
> For those of you who don't have git, forrest, *and* Java 5
> (not 6! 5!) installed, I built the docs and put them online:
> 
> http://www.projectornation.com/avro-doc/spec.html

Thanks!

> - No code generation.  The schema is all in JSON files that are parsed
>   at runtime.  For Python, this is probably fine.  I'm not really clear
>   on how it looks for Java (maybe someone can look at the Java tests and
>   explain it to the rest of us).  For C++, this will definitely make
>   the avro objects feel clunky because you'll have to access properties
>   by name.  And the lists won't be statically typed.

For C++ we'll probably implement code generation in Avro.  Java already 
includes code generation as an option.  Code generation isn't 
prohibited, it's just optional.  My guess is that it will only be 
implemented in Avro for C/C++ and Java.

Also, you need not access properties by name.  For example, the reader 
for generated Java code maintains an int->int mapping of remote fields 
to local fields, and fields are accessed by integer.  This is 
effectively what you must do in any generated code: you need a switch 
statement that maps a field id to the line of code which sets the field. 
  In Thrift and Protocol buffers, the remote field id is in the data, 
while in Avro its instead in the schema.

> - The full schema is included with the messages, rather than having
>   field ids delimit the contents.  This is nice for big Hadoop files
>   since you only include the schema once.  (It was a technique that
>   we discussed for Thrift.)  For a system like (I guess) Hadoop that
>   has long-lived RPC connections with multiple messages passed, I guess
>   it is not that big of a deal either.  For a system like we have at
>   Facebook where the web server must connect to the feed/search/chat
>   server once for each RPC, it is a killer.

This can be optimized by instead passing the hash of the schema, and 
faulting if the other side has not previously seen that schema, sending 
it on demand.  I've not yet had time to completely specify and implement 
this approach yet, but I think it addresses your concern here.  The 
fundamental requirement is only that the server and client somehow have 
copies of each other's schemas, not that they exchange them with each 
message or connection.  This is why the handshake has a version number, 
to permit different mechanisms here.  The first one is the simplest.

Doug

Re: Avro, a cross-language serialization framework from Doug Cutting, proposed as Hadoop subproject

Posted by David Reiss <dr...@facebook.com>.

For those of you who don't have git, forrest, *and* Java 5
(not 6! 5!) installed, I built the docs and put them online:

http://www.projectornation.com/avro-doc/spec.html

AFAICT, the main differences from Thrift are:

- No code generation.  The schema is all in JSON files that are parsed
  at runtime.  For Python, this is probably fine.  I'm not really clear
  on how it looks for Java (maybe someone can look at the Java tests and
  explain it to the rest of us).  For C++, this will definitely make
  the avro objects feel clunky because you'll have to access properties
  by name.  And the lists won't be statically typed.
- The full schema is included with the messages, rather than having
  field ids delimit the contents.  This is nice for big Hadoop files
  since you only include the schema once.  (It was a technique that
  we discussed for Thrift.)  For a system like (I guess) Hadoop that
  has long-lived RPC connections with multiple messages passed, I guess
  it is not that big of a deal either.  For a system like we have at
  Facebook where the web server must connect to the feed/search/chat
  server once for each RPC, it is a killer.

--David



Bryan Duxbury wrote:
> Indeed, I am very curious about how this differs from Thrift.
> 
> On Apr 2, 2009, at 7:48 PM, Kevin Clark wrote:
> 
>> Reposting from thrift-user.
>>
>> On Thu, Apr 2, 2009 at 7:19 PM, Jeff Hammerbacher  
>> <ha...@cloudera.com> wrote:
>>> See http://markmail.org/thread/7cgrwoc4er4mr3bp
>>>
>> Is this a vote of no confidence on Doug's part? Last I heard, he was
>> still one of our mentors, and this project sounds an awful lot like
>> Thrift.
>>
>>
>>
>> -- 
>> Kevin Clark
>> http://glu.ttono.us
>

Re: Avro, a cross-language serialization framework from Doug Cutting, proposed as Hadoop subproject

Posted by Bryan Duxbury <br...@rapleaf.com>.

Indeed, I am very curious about how this differs from Thrift.

On Apr 2, 2009, at 7:48 PM, Kevin Clark wrote:

> Reposting from thrift-user.
>
> On Thu, Apr 2, 2009 at 7:19 PM, Jeff Hammerbacher  
> <ha...@cloudera.com> wrote:
>> See http://markmail.org/thread/7cgrwoc4er4mr3bp
>>
>
> Is this a vote of no confidence on Doug's part? Last I heard, he was
> still one of our mentors, and this project sounds an awful lot like
> Thrift.
>
>
>
> -- 
> Kevin Clark
> http://glu.ttono.us

Re: Avro, a cross-language serialization framework from Doug Cutting, proposed as Hadoop subproject

Posted by Doug Cutting <cu...@apache.org>.

Kevin Clark wrote:
> Is this a vote of no confidence on Doug's part?

Avro's a stab at a different approach from Thrift.  I looked into 
modifying Thrift to have the features I wanted and it looked like 
there'd be little shared code.  One would need to somehow parse the IDL 
in each language.  One would need a new data format that takes advantage 
of the presence of the IDL and readers and writers of that format.  But 
that's pretty much all that Avro is.  So with no code intersection, it 
felt more natural as a separate project.  We'll see how it fares.

I don't know what a "vote of no confidence" means.  I have never been a 
user of Thrift.  I evaluated Thrift for use as a Y! internal data 
format, and ended up devising something different for that problem, 
Avro.  In this process I became and remain a mentor of Thrift at Apache, 
but, as always, I have lots of other responsibilities, and this 
volunteer activity often falls off my radar for weeks at a time.

Cheers,

Doug

Re: Avro, a cross-language serialization framework from Doug Cutting, proposed as Hadoop subproject

Posted by Kevin Clark <ke...@gmail.com>.

Reposting from thrift-user.

On Thu, Apr 2, 2009 at 7:19 PM, Jeff Hammerbacher <ha...@cloudera.com> wrote:
> See http://markmail.org/thread/7cgrwoc4er4mr3bp
>

Is this a vote of no confidence on Doug's part? Last I heard, he was
still one of our mentors, and this project sounds an awful lot like
Thrift.

-- 
Kevin Clark
http://glu.ttono.us