You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Jay Kreps <bo...@gmail.com> on 2008/09/01 22:52:50 UTC

Serialization with additional schema info

Hi All,

I am interested in hooking up a custom serialization layer I use to the new
pluggable Hadoop serialization framework. It appears that the framework
assumes there is a one-to-one mapping between java classes and
serializations.  This is exactly what we want to get away from--having a
common data format allows us to easily write generic data aggregation jobs
that work with any type. This is exactly how a database supports many
generic operations such as joins, group bys, etc--because the dataformat is
always a set of tuples which can be generically manipulated without
understanding any of the details of interpretation rather than user defined
complex types the db can't operate on. To do this I need to store data in a
standard way with supported types and have a short string schema description
along with each file, and pass that description to a generic
serializer/deserializer in order to tell it how to read the bytes in the
file. The problem I have is that there is no way to get the additional
schema information into the serializer to tell it how to serialize and
deserialize.

Some Details in case the general problem is too vague:

A very nice generic data format that maps well to programming languages is
JSON. For example a user could be stored like this: {"name":"Jay",
"date-o-birth":"05-25-1980", "age":28, "is_active": true, etc.}. But since
we store the same fields with each "row", this is highly inefficient. It
makes more sense to just store the necessary bytes for the values, and store
what fields we are expecting, and the expected type seperately. This let's
us store numbers compactly as well.

JSON supports numbers, strings, lists, and maps, which all have natural
mappings in Java. The above user example would translate to a java Map
containing the given keys and values.

Here is where the trouble starts. I can't do this in the existing
SerializationFactory because the type for the object is just Map.class, but
that doesn't contain enough info to properly deserialize the class. In
reality I need a string describing the type, such as
  {"name":"string", "date-o-birth":"date", "age":"int32",
"is_active":"boolean", ...}
Note that this string contains all the information needed to add in the
property names and to correctly interpret the bytes as Integer or Boolean,
or whatever.

The obvious solution is to just add this schema into the JobConf as a
property such as "map.key.schema.info", and use it to construct the right
serializer in the Serialization implmentation. The problem with this is that
there is no way for the Serialization implementation to know whether it is
constructing the map key, map value, reduce key, or reduce value.

Some possible solutions:

For now I am just sticking with wrapping up map and reduce to do the
serialization/deserialization to solve my problem. However this seems like a
common case where the serialization needs information not present in the
class itself, and I would like to add support to do it right. Would you guys
accept a patch that did one of the following:

1. Make SerializationFactory have a getMapKeySerializer,
getMapValueSerializer, etc. method and allow the user to specify their own
SerializationFactory by setting a property with the appropriate class name.
This is probably the most flexible and doesn't break any user serialization
implementations. The getMapKeySerializer method can then check the
map.key.schema.info in addition to mapred.mapinput.key.class.
2. Change Serialization.getSerializer(Class c) to
Serialization.getSerializer(Class c, SerializerType k) where SerializerType
= enum {MapKey, MapValue, ReduceKey, ReduceValue}. This allows the
serialization implementer to invent their own properties (map.key.schema or
whatever) and fetch the appropriate thing.
3. Add mapred.mapinput.serializer.info, mapred.reduceinput.serializer.info,
etc. and pass the value of this into the constructor of the serializer if it
has a constructor with a single string argument.

Or maybe there a better way to accomplish this?

Thanks!

-Jay

Re: Serialization with additional schema info

Posted by Pete Wyckoff <pw...@facebook.com>.

I think what you may be looking for is something like Hive's DynamicSerDe.
It takes a DDL (in thrift format) and then can serialize/deserialize data in
any format that thrift supports (binary, json, ...) + control separator
delimited (in the case of text data). It can write the data in thrift
compatible mode or compact mode.

It supports all types thrift supports including int, double, string, binary,
map, list, set, and struct - so it supports nested types.


-- pete


On 9/3/08 6:15 AM, "Tom White" <to...@gmail.com> wrote:

> Jay,
> 
> The Serialization and MapReduce APIs are very class-based - so having
> fixed types with dynamic serialization capabilities doesn't work as
> well in the current design.
> 
> I like 2 better than 1, but both make the Serialization API dependent
> on MapReduce, which it currently isn't. And arguably it shouldn't be
> as you could use it simply to do serialization of data, outside a
> MapReduce context. Perhaps SerializerType is just a String, which also
> makes things more flexible (at the expense of type-safety)?
> 
> What would the API changes look like for 3?
> 
> Also, I believe the Jaql team has been looking at how to write JSON
> serializers, so perhaps there is an opportunity for collaboration
> here?
> 
> Tom
> 
> On Mon, Sep 1, 2008 at 9:52 PM, Jay Kreps <bo...@gmail.com> wrote:
>> Hi All,
>> 
>> I am interested in hooking up a custom serialization layer I use to the new
>> pluggable Hadoop serialization framework. It appears that the framework
>> assumes there is a one-to-one mapping between java classes and
>> serializations.  This is exactly what we want to get away from--having a
>> common data format allows us to easily write generic data aggregation jobs
>> that work with any type. This is exactly how a database supports many
>> generic operations such as joins, group bys, etc--because the dataformat is
>> always a set of tuples which can be generically manipulated without
>> understanding any of the details of interpretation rather than user defined
>> complex types the db can't operate on. To do this I need to store data in a
>> standard way with supported types and have a short string schema description
>> along with each file, and pass that description to a generic
>> serializer/deserializer in order to tell it how to read the bytes in the
>> file. The problem I have is that there is no way to get the additional
>> schema information into the serializer to tell it how to serialize and
>> deserialize.
>> 
>> Some Details in case the general problem is too vague:
>> 
>> A very nice generic data format that maps well to programming languages is
>> JSON. For example a user could be stored like this: {"name":"Jay",
>> "date-o-birth":"05-25-1980", "age":28, "is_active": true, etc.}. But since
>> we store the same fields with each "row", this is highly inefficient. It
>> makes more sense to just store the necessary bytes for the values, and store
>> what fields we are expecting, and the expected type seperately. This let's
>> us store numbers compactly as well.
>> 
>> JSON supports numbers, strings, lists, and maps, which all have natural
>> mappings in Java. The above user example would translate to a java Map
>> containing the given keys and values.
>> 
>> Here is where the trouble starts. I can't do this in the existing
>> SerializationFactory because the type for the object is just Map.class, but
>> that doesn't contain enough info to properly deserialize the class. In
>> reality I need a string describing the type, such as
>>  {"name":"string", "date-o-birth":"date", "age":"int32",
>> "is_active":"boolean", ...}
>> Note that this string contains all the information needed to add in the
>> property names and to correctly interpret the bytes as Integer or Boolean,
>> or whatever.
>> 
>> The obvious solution is to just add this schema into the JobConf as a
>> property such as "map.key.schema.info", and use it to construct the right
>> serializer in the Serialization implmentation. The problem with this is that
>> there is no way for the Serialization implementation to know whether it is
>> constructing the map key, map value, reduce key, or reduce value.
>> 
>> Some possible solutions:
>> 
>> For now I am just sticking with wrapping up map and reduce to do the
>> serialization/deserialization to solve my problem. However this seems like a
>> common case where the serialization needs information not present in the
>> class itself, and I would like to add support to do it right. Would you guys
>> accept a patch that did one of the following:
>> 
>> 1. Make SerializationFactory have a getMapKeySerializer,
>> getMapValueSerializer, etc. method and allow the user to specify their own
>> SerializationFactory by setting a property with the appropriate class name.
>> This is probably the most flexible and doesn't break any user serialization
>> implementations. The getMapKeySerializer method can then check the
>> map.key.schema.info in addition to mapred.mapinput.key.class.
>> 2. Change Serialization.getSerializer(Class c) to
>> Serialization.getSerializer(Class c, SerializerType k) where SerializerType
>> = enum {MapKey, MapValue, ReduceKey, ReduceValue}. This allows the
>> serialization implementer to invent their own properties (map.key.schema or
>> whatever) and fetch the appropriate thing.
>> 3. Add mapred.mapinput.serializer.info, mapred.reduceinput.serializer.info,
>> etc. and pass the value of this into the constructor of the serializer if it
>> has a constructor with a single string argument.
>> 
>> Or maybe there a better way to accomplish this?
>> 
>> Thanks!
>> 
>> -Jay
>>

Re: Serialization with additional schema info

Posted by Tom White <to...@gmail.com>.

Jay,

The Serialization and MapReduce APIs are very class-based - so having
fixed types with dynamic serialization capabilities doesn't work as
well in the current design.

I like 2 better than 1, but both make the Serialization API dependent
on MapReduce, which it currently isn't. And arguably it shouldn't be
as you could use it simply to do serialization of data, outside a
MapReduce context. Perhaps SerializerType is just a String, which also
makes things more flexible (at the expense of type-safety)?

What would the API changes look like for 3?

Also, I believe the Jaql team has been looking at how to write JSON
serializers, so perhaps there is an opportunity for collaboration
here?

Tom

On Mon, Sep 1, 2008 at 9:52 PM, Jay Kreps <bo...@gmail.com> wrote:
> Hi All,
>
> I am interested in hooking up a custom serialization layer I use to the new
> pluggable Hadoop serialization framework. It appears that the framework
> assumes there is a one-to-one mapping between java classes and
> serializations.  This is exactly what we want to get away from--having a
> common data format allows us to easily write generic data aggregation jobs
> that work with any type. This is exactly how a database supports many
> generic operations such as joins, group bys, etc--because the dataformat is
> always a set of tuples which can be generically manipulated without
> understanding any of the details of interpretation rather than user defined
> complex types the db can't operate on. To do this I need to store data in a
> standard way with supported types and have a short string schema description
> along with each file, and pass that description to a generic
> serializer/deserializer in order to tell it how to read the bytes in the
> file. The problem I have is that there is no way to get the additional
> schema information into the serializer to tell it how to serialize and
> deserialize.
>
> Some Details in case the general problem is too vague:
>
> A very nice generic data format that maps well to programming languages is
> JSON. For example a user could be stored like this: {"name":"Jay",
> "date-o-birth":"05-25-1980", "age":28, "is_active": true, etc.}. But since
> we store the same fields with each "row", this is highly inefficient. It
> makes more sense to just store the necessary bytes for the values, and store
> what fields we are expecting, and the expected type seperately. This let's
> us store numbers compactly as well.
>
> JSON supports numbers, strings, lists, and maps, which all have natural
> mappings in Java. The above user example would translate to a java Map
> containing the given keys and values.
>
> Here is where the trouble starts. I can't do this in the existing
> SerializationFactory because the type for the object is just Map.class, but
> that doesn't contain enough info to properly deserialize the class. In
> reality I need a string describing the type, such as
>  {"name":"string", "date-o-birth":"date", "age":"int32",
> "is_active":"boolean", ...}
> Note that this string contains all the information needed to add in the
> property names and to correctly interpret the bytes as Integer or Boolean,
> or whatever.
>
> The obvious solution is to just add this schema into the JobConf as a
> property such as "map.key.schema.info", and use it to construct the right
> serializer in the Serialization implmentation. The problem with this is that
> there is no way for the Serialization implementation to know whether it is
> constructing the map key, map value, reduce key, or reduce value.
>
> Some possible solutions:
>
> For now I am just sticking with wrapping up map and reduce to do the
> serialization/deserialization to solve my problem. However this seems like a
> common case where the serialization needs information not present in the
> class itself, and I would like to add support to do it right. Would you guys
> accept a patch that did one of the following:
>
> 1. Make SerializationFactory have a getMapKeySerializer,
> getMapValueSerializer, etc. method and allow the user to specify their own
> SerializationFactory by setting a property with the appropriate class name.
> This is probably the most flexible and doesn't break any user serialization
> implementations. The getMapKeySerializer method can then check the
> map.key.schema.info in addition to mapred.mapinput.key.class.
> 2. Change Serialization.getSerializer(Class c) to
> Serialization.getSerializer(Class c, SerializerType k) where SerializerType
> = enum {MapKey, MapValue, ReduceKey, ReduceValue}. This allows the
> serialization implementer to invent their own properties (map.key.schema or
> whatever) and fetch the appropriate thing.
> 3. Add mapred.mapinput.serializer.info, mapred.reduceinput.serializer.info,
> etc. and pass the value of this into the constructor of the serializer if it
> has a constructor with a single string argument.
>
> Or maybe there a better way to accomplish this?
>
> Thanks!
>
> -Jay
>