You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Tom White (JIRA)" <ji...@apache.org> on 2007/10/02 23:01:52 UTC

[jira] Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce

    [ https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531912 ] 

Tom White commented on HADOOP-1986:
-----------------------------------

Here's an initial plan:

1. Remove the requirement for Map Reduce types to extend WritableComparable/Writable. So for example Mapper would become:

{code}
public interface Mapper<K1, V1, K2, V2>
{code}

2. Create a serialization class that can turn objects into byte streams and vice versa. Something like:

{code}
public interface Serializer<T> {
  void serialize(T t, OutputStream out) throws IOException;
  void deserialize(T t, InputStream in) throws IOException;
}
{code}

3. Add a configuration property to specify the Serializer to use. This would default to WritableSerializer (an implementation of Serializer<Writable>).

4. Change the type of the output key comparator to be Comparator<T>, with default WritableComparator (an implmentation of Comparator<Writable>).

5. In MapTask use a Serializer to write map outputs to the output files.

6. In ReduceTask use a Serializer to read sorted map outputs for the reduce phase.

I've played with some of this and it looks like it would work, however there is a problem with the Serializer interface above as it stands. The serialize method of WritableSerializer looks like this:

{code}
public void serialize(Writable w, OutputStream out) throws IOException {
  w.write(new DataOutputStream(out));
}
{code}

Clearly it is not acceptable to create a new object on every write. This is a general problem - Writables write to a DataOutputStream, Thrift objects write to a TProtocol, etc. So the solution is probably to make the Serializer stateful, having the same lifetime as the wrapped stream. Something like:

{code}
public interface Serializer<T> {
  void open(OutputStream out);
  void serialize(T t);
  void close();
}
{code}
(There would be a similar Deserializer interface.)

Could this work?

> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1986
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>             Fix For: 0.16.0
>
>
> Currently Map Reduce programs have to use WritableComparable-Writable key-value pairs. While it's possible to write Writable wrappers for other serialization frameworks (such as Thrift), this is not very convenient: it would be nicer to be able to use arbitrary types directly, without explicit wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.