You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Tom White (JIRA)" <ji...@apache.org> on 2007/10/02 23:01:52 UTC
[jira] Commented: (HADOOP-1986) Add support for a general
serialization mechanism for Map Reduce
[ https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531912 ]
Tom White commented on HADOOP-1986:
-----------------------------------
Here's an initial plan:
1. Remove the requirement for Map Reduce types to extend WritableComparable/Writable. So for example Mapper would become:
{code}
public interface Mapper<K1, V1, K2, V2>
{code}
2. Create a serialization class that can turn objects into byte streams and vice versa. Something like:
{code}
public interface Serializer<T> {
void serialize(T t, OutputStream out) throws IOException;
void deserialize(T t, InputStream in) throws IOException;
}
{code}
3. Add a configuration property to specify the Serializer to use. This would default to WritableSerializer (an implementation of Serializer<Writable>).
4. Change the type of the output key comparator to be Comparator<T>, with default WritableComparator (an implmentation of Comparator<Writable>).
5. In MapTask use a Serializer to write map outputs to the output files.
6. In ReduceTask use a Serializer to read sorted map outputs for the reduce phase.
I've played with some of this and it looks like it would work, however there is a problem with the Serializer interface above as it stands. The serialize method of WritableSerializer looks like this:
{code}
public void serialize(Writable w, OutputStream out) throws IOException {
w.write(new DataOutputStream(out));
}
{code}
Clearly it is not acceptable to create a new object on every write. This is a general problem - Writables write to a DataOutputStream, Thrift objects write to a TProtocol, etc. So the solution is probably to make the Serializer stateful, having the same lifetime as the wrapped stream. Something like:
{code}
public interface Serializer<T> {
void open(OutputStream out);
void serialize(T t);
void close();
}
{code}
(There would be a similar Deserializer interface.)
Could this work?
> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>
> Key: HADOOP-1986
> URL: https://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Fix For: 0.16.0
>
>
> Currently Map Reduce programs have to use WritableComparable-Writable key-value pairs. While it's possible to write Writable wrappers for other serialization frameworks (such as Thrift), this is not very convenient: it would be nicer to be able to use arbitrary types directly, without explicit wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.