You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Bryan Duxbury <br...@rapleaf.com> on 2008/10/24 17:40:35 UTC

Multi-language serialization discussion

I've been reading the discussion about what serialization/RPC project  
to use on http://wiki.apache.org/hadoop/Release1.0Requirements, and I  
thought I'd throw in a pro-Thrift vote.

It's already got libraries in many languages. It has pluggable  
protocols and transports (including asynchronous ones, in some  
languages), and even includes server implementations. It's  
automatically versioned and transparently backwards compatible with  
other versions. It will shortly be just as compact and fast as  
Protocol Buffers, and it's already in Apache.

As someone who has been working with and contributing to Thrift for a  
few months now, I have to say it seems like it fits the bill pretty  
nicely. The only possible issue is maturity, but hey as an Apache  
project yourself, you know what the solution to that is - patches!

If there are further detailed questions about features of Thrift or  
problems it might have, I'd love to make myself available to help  
answer them.

Thanks
Bryan

Re: Multi-language serialization discussion

Posted by Doug Cutting <cu...@apache.org>.
Chad Walters wrote:
> Re-open that discussion and I imagine you might get some interested parties.

I think I just did, no?

> Bumping up a level, rather than inventing a whole new set of Hadoop-specific RPC and serialization mechanisms

Whatever we use, we'd probably end up recycling much of Hadoop's 
client/server implementation, since it's been finely tuned for Hadoop's 
performance needs, and I've not yet seen a Thrift transport that looks 
appropriate.  We also need to add authentication and authorization 
layers to Hadoop's RPC, which don't exist in Thrift either, as far as I 
can tell.  So mostly what we'd use from Thrift directly is object 
serialization.

That said, if we use Thrift for object serialization then we'd probably 
eventually contribute our transport, authentication and authorization 
stuff to the Thrift project.  We'd probably want to build it first in 
Hadoop, since it's critical kernel stuff for Hadoop, but, once it's 
stable, contribute it to Thrift if it seemed useful to others.

As a serialization layer, Thrift lacks the self-describing stuff that I 
think is critical.  If JSON will be the primary text format, then it 
looks to me that it would be easier and more natural to base a binary 
self-describing format on JSON schema than on Thrift IDL, but perhaps I 
can be convinced otherwise.

Doug

RE: Multi-language serialization discussion

Posted by Chad Walters <Ch...@microsoft.com>.
Doug,

There has previously been a bunch of discussion on the Thrift list (possibly pre-Incubator) about self-describing Thrift streams and the like when we talked about providing a superset of RecordIO functionality. Re-open that discussion and I imagine you might get some interested parties. Writing an interpreter of Thrift type descriptors for any of the scripting language doesn't seem like it would be that hard.

Bumping up a level, rather than inventing a whole new set of Hadoop-specific RPC and serialization mechanisms, I'd suggest that there would be more leverage from adopting Thrift. Thrift is in the Apache Incubator (as you know ;)) and there is already a fairly significant overlap in the two communities. A number of Hadoop-related technologies are already using Thrift in places (HBase, Hive, etc). If there was more involvement in Thrift from core Hadoop development, I am pretty certain you would get what you wanted out of it pretty quickly.

Chad

-----Original Message-----
From: Doug Cutting [mailto:cutting@gmail.com] On Behalf Of Doug Cutting
Sent: Friday, October 24, 2008 2:40 PM
To: core-dev@hadoop.apache.org
Subject: Re: Multi-language serialization discussion

Bryan Duxbury wrote:
> I've been reading the discussion about what serialization/RPC project to
> use on http://wiki.apache.org/hadoop/Release1.0Requirements, and I
> thought I'd throw in a pro-Thrift vote.

I've been thinking about this, and here's where I've come to:

It's not just RPC.  We need a single, primary object serialization
system that's used for RPC and for most file-based application data.

Scripting languages are primary users of Hadoop.  We must thus make it
easy and natural for scripting languages to process data with Hadoop.

Data should be self-describing.  For example, a script should be able to
read a file without having to first generate code specific to the
records in that file.  Similarly, a script should be able to write
records without having to externally define their schema.

We need an efficient binary file format.  A file of records should not
repeat the record names with each record.  Rather, the record schema
used should be stored in the file once.  Programs should be able to read
the schema and efficiently produce instances from the file.

The schema language should support specification of required and
optional fields, so that class definitions may evolve.

For some languages (e.g., Java & C) one may wish to generate native
classes to represent a schema, and to read & write instances.

So, how well does Thrift meet these needs?  Thrift's IDL is a schema
language, and JSON is a self-describing data format.  But arbitrary JSON
data is not generally readable by any Thrift-based program.  And
Thrift's binary formats are not self-describing: they do not include the
IDL.  Nor does the Thrift runtime in each language permit one to read an
IDL specification and then use it to efficiently read and write compact,
self-describing data.

I wonder if we might instead use use JSON schemas to describe data.

http://groups.google.com/group/json-schema/web/json-schema-proposal---second-draft

We'd implement, in each language, a codec that, given a schema, can
efficiently read and write instances of that schema.  (JSON schemas are
JSON data, so any language that supports JSON can already read and write
a JSON schema.)  The writer could either take a provided schema, or
automatically induce a schema from the records written.  Schemas would
be stored in data files, with the data.

JSON's not perfect.  It doesn't (yet) support binary data: that would
need to be fixed.  But I think Thrift's focus on code-generation makes
it less friendly to scripting languages, which are primary users of
Hadoop.  Code generation is possible given a schema, and may be useful
as an optimization in many cases, but it should be optional, not central.

Folks should be able to process any file without external information or
external compilers.  A small runtime codec is be all that should be
implemented in each language.  Even if that's not present, data could be
transparently and losslessly converted to and from textual JSON by, e.g.
C utility programs, since most languages already have JSON codecs.

Does this make any sense?

Doug


Re: Multi-language serialization discussion

Posted by Doug Cutting <cu...@apache.org>.
Sanjay Radia wrote:
> I like the self describing data for the reasons you have state.
> Q. I assume that in many cases the reader of some serialized data is 
> expecting a particular data-definition (or versions of it). In this case 
> the
> reader has the expected data-definition that was generated from the idl. 
> If the two data-definitions (the one from the idl and the other from the 
> serialized data)  do not match (modulo versions), then is an exception 
> is thrown?

If there are required fields in the expected IDL that are not in the 
data, then, yes, an exception should be thrown.

Doug

Re: Multi-language serialization discussion

Posted by Sanjay Radia <sr...@yahoo-inc.com>.
On Oct 24, 2008, at 2:39 PM, Doug Cutting wrote:

> Bryan Duxbury wrote:
> > I've been reading the discussion about what serialization/RPC  
> project to
> > use on http://wiki.apache.org/hadoop/Release1.0Requirements, and I
> > thought I'd throw in a pro-Thrift vote.
>
> I've been thinking about this, and here's where I've come to:
>
> It's not just RPC.  We need a single, primary object serialization
> system that's used for RPC and for most file-based application data.
>
> Scripting languages are primary users of Hadoop.  We must thus make it
> easy and natural for scripting languages to process data with Hadoop.
>
> Data should be self-describing.  For example, a script should be  
> able to
> read a file without having to first generate code specific to the
> records in that file.  Similarly, a script should be able to write
> records without having to externally define their schema.
>

I like the self describing data for the reasons you have state.
Q. I assume that in many cases the reader of some serialized data is  
expecting a particular data-definition (or versions of it). In this  
case the
reader has the expected data-definition that was generated from the  
idl. If the two data-definitions (the one from the idl and the other  
from the serialized data)  do not match (modulo versions), then is an  
exception is thrown?

sanjay
>
>
> We need an efficient binary file format.  A file of records should not
> repeat the record names with each record.  Rather, the record schema
> used should be stored in the file once.  Programs should be able to  
> read
> the schema and efficiently produce instances from the file.
>
> The schema language should support specification of required and
> optional fields, so that class definitions may evolve.
>
> For some languages (e.g., Java & C) one may wish to generate native
> classes to represent a schema, and to read & write instances.
>
> So, how well does Thrift meet these needs?  Thrift's IDL is a schema
> language, and JSON is a self-describing data format.  But arbitrary  
> JSON
> data is not generally readable by any Thrift-based program.  And
> Thrift's binary formats are not self-describing: they do not include  
> the
> IDL.  Nor does the Thrift runtime in each language permit one to  
> read an
> IDL specification and then use it to efficiently read and write  
> compact,
> self-describing data.
>
> I wonder if we might instead use use JSON schemas to describe data.
>
> http://groups.google.com/group/json-schema/web/json-schema-proposal---second-draft
>
> We'd implement, in each language, a codec that, given a schema, can
> efficiently read and write instances of that schema.  (JSON schemas  
> are
> JSON data, so any language that supports JSON can already read and  
> write
> a JSON schema.)  The writer could either take a provided schema, or
> automatically induce a schema from the records written.  Schemas would
> be stored in data files, with the data.
>
> JSON's not perfect.  It doesn't (yet) support binary data: that would
> need to be fixed.  But I think Thrift's focus on code-generation makes
> it less friendly to scripting languages, which are primary users of
> Hadoop.  Code generation is possible given a schema, and may be useful
> as an optimization in many cases, but it should be optional, not  
> central.
>
> Folks should be able to process any file without external  
> information or
> external compilers.  A small runtime codec is be all that should be
> implemented in each language.  Even if that's not present, data  
> could be
> transparently and losslessly converted to and from textual JSON by,  
> e.g.
> C utility programs, since most languages already have JSON codecs.
>
> Does this make any sense?
>
> Doug
>


Re: Multi-language serialization discussion

Posted by Jeff Hammerbacher <je...@gmail.com>.
Hey Pete,

Can you write up some documentation on DynamicSerDe for the wiki? It's
come up a few times in discussion and I think it would be of general
use for people.

Thanks,
Jeff

On Mon, Oct 27, 2008 at 12:13 PM, Pete Wyckoff <pw...@facebook.com> wrote:
>
>>   You'd still need to write IDL parsers & processors for each platform.
>
> Fyi - Hadoop already has this for Java - in hive/serde/DynamicSerDe. This is exactly that and gives one the ability to read and write thrift and non-thrift data without compilation.
>
> -- pete
>
> On 10/27/08 12:01 PM, "Doug Cutting" <cu...@apache.org> wrote:
>
> Ted Dunning wrote:
>> I don't think that it would be a major inconvenience in any of the major
>> scripting languages to change the meaning of "open" to mean that you must
>> read the IDL for a file, generate a reading script, load that and now be
>> ready to read.  This is a scripting language after all.
>
> That sounds like compilation, which isn't very scripty.  It's certainly
> workable, but not optimal.  We want to push this stack all the way up to
> spreadsheet-type programmers, who define new record types interactively.
>  Do we really want a GUI to run the Thrift compiler each time a file is
> opened, and loading new code in?
>
>> Note that you are saying that the writer should have a schema.  This seems
>> to contradict your previous statement and agree with mine.
>
> We can induce a schema.  If an application doesn't specify an output
> schema then the first instance written might implicitly define the
> schema.  Or you could be more lax and modify the schema as instances are
> written to match all instances, then append it at the end of the file.
> So in the binary format there would always be a schema.  It would be
> used for compaction and available to readers to describe the data.
>
>>> So, how well does Thrift meet these needs?
>>
>> Very closely, actually, especially if you adjust it to allow the IDL to be
>> inside the file.
>
> Thrift has a lot of the parts, and one could probably define a Thrift
> protocol that does this.  Looking through the Thrift mail archives, it
> seems that TDenseProtocol with an IDL in the file would get you partway.
>  You'd still need to write IDL parsers & processors for each platform.
>  I'm not sure it would be any less work than to build this from
> scratch, but I guess that's up to me to prove!
>
> On one hand, it's good to have an architecture that embraces more
> different data formats.  But, in practice, its nice to have actual data
> in fewer formats, since otherwise you end up having to support the cross
> product of formats and platforms.
>
>> We should also consider the JAQL work.
>
> Yes.  I've started to look at that more.  There examples imply a binary
> format for JSON, but I can find no details.
>
> Doug
>
>
>

Re: Multi-language serialization discussion

Posted by Doug Cutting <cu...@apache.org>.
Pete Wyckoff wrote:
> Fyi - Hadoop already has this for Java - in hive/serde/DynamicSerDe. This is exactly that and gives one the ability to read and write thrift and non-thrift data without compilation.

Is this what you mean?

http://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/hive/serde/src/java/org/apache/hadoop/hive/serde2/dynamic_type/

Is there any documentation for this?

Doug

Re: Multi-language serialization discussion

Posted by Pete Wyckoff <pw...@facebook.com>.
>   You'd still need to write IDL parsers & processors for each platform.

Fyi - Hadoop already has this for Java - in hive/serde/DynamicSerDe. This is exactly that and gives one the ability to read and write thrift and non-thrift data without compilation.

-- pete

On 10/27/08 12:01 PM, "Doug Cutting" <cu...@apache.org> wrote:

Ted Dunning wrote:
> I don't think that it would be a major inconvenience in any of the major
> scripting languages to change the meaning of "open" to mean that you must
> read the IDL for a file, generate a reading script, load that and now be
> ready to read.  This is a scripting language after all.

That sounds like compilation, which isn't very scripty.  It's certainly
workable, but not optimal.  We want to push this stack all the way up to
spreadsheet-type programmers, who define new record types interactively.
  Do we really want a GUI to run the Thrift compiler each time a file is
opened, and loading new code in?

> Note that you are saying that the writer should have a schema.  This seems
> to contradict your previous statement and agree with mine.

We can induce a schema.  If an application doesn't specify an output
schema then the first instance written might implicitly define the
schema.  Or you could be more lax and modify the schema as instances are
written to match all instances, then append it at the end of the file.
So in the binary format there would always be a schema.  It would be
used for compaction and available to readers to describe the data.

>> So, how well does Thrift meet these needs?
>
> Very closely, actually, especially if you adjust it to allow the IDL to be
> inside the file.

Thrift has a lot of the parts, and one could probably define a Thrift
protocol that does this.  Looking through the Thrift mail archives, it
seems that TDenseProtocol with an IDL in the file would get you partway.
  You'd still need to write IDL parsers & processors for each platform.
  I'm not sure it would be any less work than to build this from
scratch, but I guess that's up to me to prove!

On one hand, it's good to have an architecture that embraces more
different data formats.  But, in practice, its nice to have actual data
in fewer formats, since otherwise you end up having to support the cross
product of formats and platforms.

> We should also consider the JAQL work.

Yes.  I've started to look at that more.  There examples imply a binary
format for JSON, but I can find no details.

Doug



Re: Multi-language serialization discussion

Posted by Vuk Ercegovac <ve...@us.ibm.com>.
.
.
.
>> We should also consider the JAQL work.
>
> Yes.  I've started to look at that more.  There examples imply a binary
> format for JSON, but I can find no details.
>
> Doug

The place to start for Jaql's JSON binary is:


http://code.google.com/p/jaql/source/browse/trunk/src/java/com/ibm/jaql/json/type/Item.java

An Item wraps a JSON value (arrays, objects-- called records in the code,
and atoms) for (de)serialization.

The way this comes together with Input/OutputFormats is as follows:
anything that can be read by an InputFormat is considered to be a JSON
array. The default is to assume a SequenceFileInputFormat where Item is the
type for value (consequently any JSON value). There are several ways to
override the default behavior so that other InputFormats can be used and
converted to JSON. More info can be found at
http://code.google.com/p/jaql/wiki/IO

There is currently limited support in Jaql for schema; integrating it more
deeply is one of our top priorities.
We've developed a preliminary schema language:

(http://code.google.com/p/jaql/source/browse/trunk/src/java/com/ibm/jaql/json/schema/Schema.java)
and integrated it with the language for simple validation:

(http://code.google.com/p/jaql/source/browse/trunk/src/java/com/ibm/jaql/lang/expr/core/InstanceOfExpr.java)
Since the use of schema is not deeply integrated, we are certainly open to
other schema languages such as
the current JSON Schema proposal.

As mentioned earlier in the thread, schema would be tremendously useful for
validataion as well as storage/runtime
efficiency. For Jaql, it can also be exploited for query optimization. The
current plan is to easily support validation (but not require it) when
reading in JSON. Following, we plan to look into storage and query
optimization opportunities. Deducing schemas sounds very interesting as
well!

Vuk

Re: Multi-language serialization discussion

Posted by Doug Cutting <cu...@apache.org>.
Ted Dunning wrote:
> I don't think that it would be a major inconvenience in any of the major
> scripting languages to change the meaning of "open" to mean that you must
> read the IDL for a file, generate a reading script, load that and now be
> ready to read.  This is a scripting language after all.

That sounds like compilation, which isn't very scripty.  It's certainly 
workable, but not optimal.  We want to push this stack all the way up to 
spreadsheet-type programmers, who define new record types interactively. 
  Do we really want a GUI to run the Thrift compiler each time a file is 
opened, and loading new code in?

> Note that you are saying that the writer should have a schema.  This seems
> to contradict your previous statement and agree with mine.

We can induce a schema.  If an application doesn't specify an output 
schema then the first instance written might implicitly define the 
schema.  Or you could be more lax and modify the schema as instances are 
written to match all instances, then append it at the end of the file. 
So in the binary format there would always be a schema.  It would be 
used for compaction and available to readers to describe the data.

>> So, how well does Thrift meet these needs?
> 
> Very closely, actually, especially if you adjust it to allow the IDL to be
> inside the file.

Thrift has a lot of the parts, and one could probably define a Thrift 
protocol that does this.  Looking through the Thrift mail archives, it 
seems that TDenseProtocol with an IDL in the file would get you partway. 
  You'd still need to write IDL parsers & processors for each platform. 
  I'm not sure it would be any less work than to build this from 
scratch, but I guess that's up to me to prove!

On one hand, it's good to have an architecture that embraces more 
different data formats.  But, in practice, its nice to have actual data 
in fewer formats, since otherwise you end up having to support the cross 
product of formats and platforms.

> We should also consider the JAQL work.

Yes.  I've started to look at that more.  There examples imply a binary 
format for JSON, but I can find no details.

Doug

Re: Multi-language serialization discussion

Posted by Martin Traverso <mt...@gmail.com>.
>
>
> Indeed.  And with Java, it might be nice to have the ability to read
> objects
> as dynamic objects without generating code.
>

FWIW, I'm working thrift-based serializer for Java that does something like
that. It can parse Thrift IDL files and serialize/deserialize from or into
arbitrary java objects. The serializers & deserializers are generated on the
fly with ASM and do not rely on reflection or map lookups, so their
performance is pretty close to what you get with pre-generated Thrift
classes.

I'm planning to contribute the code to Thrift once it's in reasonable shape.

Martin

Re: Multi-language serialization discussion

Posted by Ted Dunning <te...@gmail.com>.
Taking the last first:


 > Does this make any sense?

Of course!

But,

On Fri, Oct 24, 2008 at 2:39 PM, Doug Cutting <cu...@apache.org> wrote:

> It's not just RPC.  We need a single, primary object serialization system
> that's used for RPC and for most file-based application data.


Yes!


> Scripting languages are primary users of Hadoop.  We must thus make it easy
> and natural for scripting languages to process data with Hadoop.


I think that this deserves some break-down.

Let's separate scripting users into Pig and everything else.  Pig has fairly
different characteristics from other scripting languages.


> Data should be self-describing.  For example, a script should be able to
> read a file without having to first generate code specific to the records in
> that file.


I think that this may be slightly too strong.

I don't think that it would be a major inconvenience in any of the major
scripting languages to change the meaning of "open" to mean that you must
read the IDL for a file, generate a reading script, load that and now be
ready to read.  This is a scripting language after all.

Similarly, a script should be able to write records without having to
> externally define their schema.


I am not so convinced of this.  I just spent a few years fighting with a
non-schema design.  I would have LOVED to be able to give the developers a
schema to enforce proper object structure.  Talking with the Facebook guys
who store logs in Thrift (and thus have a schema), they found my
difficulties to be unimaginable.

I would vote for a requirement that at the least the writer of data say what
they are thinking that they will be writing.


> We need an efficient binary file format.  A file of records should not
> repeat the record names with each record.


Reasonable.


> Rather, the record schema used should be stored in the file once.


In the file or beside it.  It would be fairly trivial to change Thrift to
all an included IDL at the beginning.

Note that you are saying that the writer should have a schema.  This seems
to contradict your previous statement and agree with mine.



> The schema language should support specification of required and optional
> fields, so that class definitions may evolve.


As does Thrift.


> For some languages (e.g., Java & C) one may wish to generate native classes
> to represent a schema, and to read & write instances.


Indeed.  And with Java, it might be nice to have the ability to read objects
as dynamic objects without generating code.


> So, how well does Thrift meet these needs?


Very closely, actually, especially if you adjust it to allow the IDL to be
inside the file.

I wonder if we might instead use use JSON schemas to describe data.
>
>
> http://groups.google.com/group/json-schema/web/json-schema-proposal---second-draft


We should also consider the JAQL work.

.... But I think Thrift's focus on code-generation makes it less friendly to
> scripting languages, which are primary users of Hadoop.  Code generation is
> possible given a schema, and may be useful as an optimization in many cases,
> but it should be optional, not central.


I think that this is a red herring.  Thrift's current standard practice is
code generation, but in scripting languages it is easy to do this on the fly
at file-open time.  In java it is easy to read the IDL and use it to build
dynamic objects.


> ... Even if that's not present, data could be transparently and losslessly
> converted to and from textual JSON by, e.g. C utility programs, since most
> languages already have JSON codecs.


This is already quite doable with thrift especially if you allow for
on-the-fly code generation.

Re: Multi-language serialization discussion

Posted by Doug Cutting <cu...@apache.org>.
Bryan Duxbury wrote:
> I've been reading the discussion about what serialization/RPC project to 
> use on http://wiki.apache.org/hadoop/Release1.0Requirements, and I 
> thought I'd throw in a pro-Thrift vote.

I've been thinking about this, and here's where I've come to:

It's not just RPC.  We need a single, primary object serialization 
system that's used for RPC and for most file-based application data.

Scripting languages are primary users of Hadoop.  We must thus make it 
easy and natural for scripting languages to process data with Hadoop.

Data should be self-describing.  For example, a script should be able to 
read a file without having to first generate code specific to the 
records in that file.  Similarly, a script should be able to write 
records without having to externally define their schema.

We need an efficient binary file format.  A file of records should not 
repeat the record names with each record.  Rather, the record schema 
used should be stored in the file once.  Programs should be able to read 
the schema and efficiently produce instances from the file.

The schema language should support specification of required and 
optional fields, so that class definitions may evolve.

For some languages (e.g., Java & C) one may wish to generate native 
classes to represent a schema, and to read & write instances.

So, how well does Thrift meet these needs?  Thrift's IDL is a schema 
language, and JSON is a self-describing data format.  But arbitrary JSON 
data is not generally readable by any Thrift-based program.  And 
Thrift's binary formats are not self-describing: they do not include the 
IDL.  Nor does the Thrift runtime in each language permit one to read an 
IDL specification and then use it to efficiently read and write compact, 
self-describing data.

I wonder if we might instead use use JSON schemas to describe data.

http://groups.google.com/group/json-schema/web/json-schema-proposal---second-draft

We'd implement, in each language, a codec that, given a schema, can 
efficiently read and write instances of that schema.  (JSON schemas are 
JSON data, so any language that supports JSON can already read and write 
a JSON schema.)  The writer could either take a provided schema, or 
automatically induce a schema from the records written.  Schemas would 
be stored in data files, with the data.

JSON's not perfect.  It doesn't (yet) support binary data: that would 
need to be fixed.  But I think Thrift's focus on code-generation makes 
it less friendly to scripting languages, which are primary users of 
Hadoop.  Code generation is possible given a schema, and may be useful 
as an optimization in many cases, but it should be optional, not central.

Folks should be able to process any file without external information or 
external compilers.  A small runtime codec is be all that should be 
implemented in each language.  Even if that's not present, data could be 
transparently and losslessly converted to and from textual JSON by, e.g. 
C utility programs, since most languages already have JSON codecs.

Does this make any sense?

Doug