You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2010/03/25 20:24:27 UTC

[jira] Created: (HADOOP-6660) add Writable wrapper for Avro data

add Writable wrapper for Avro data
----------------------------------

                 Key: HADOOP-6660
                 URL: https://issues.apache.org/jira/browse/HADOOP-6660
             Project: Hadoop Common
          Issue Type: New Feature
          Components: io
            Reporter: Doug Cutting
             Fix For: 0.22.0


To permit passing Avro data through MapReduce we can add a wrapper class that implements Writable.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6660) add wrapper for Avro data

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849907#action_12849907 ] 

Doug Cutting commented on HADOOP-6660:
--------------------------------------

Since Writable will not be implemented, but instead a serialization will be implemented, the wrapper class need not contain pointers to the schema, reader, and writer implementations, as these can now reside in the serializer and deserializer.  Thus the two wrapper classes become something like:

{code}
public class AvroValue implements Writable {
  private Object value;
}
// marker class for keys
public class AvroKey extends AvroValue implements Comparable {}
{code}

> add wrapper for Avro data
> -------------------------
>
>                 Key: HADOOP-6660
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6660
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Doug Cutting
>             Fix For: 0.22.0
>
>
> To permit passing Avro data through MapReduce we can add a wrapper class and serialization for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-6660) add wrapper for Avro data

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-6660:
---------------------------------

    Description: 
To permit passing Avro data through MapReduce we can add a wrapper class and serialization for it.


  was:
To permit passing Avro data through MapReduce we can add a wrapper class that implements Writable.


        Summary: add wrapper for Avro data  (was: add Writable wrapper for Avro data)

On futher thought, the wrapper need not implement Writable.  While WritableSerialization would be able to correctly write AvroValue instances, it could not construct them and hence could not read them correctly.  Instead this should define a new Serialization, AvroValueSerialization, included by default in io.serializations, that accepts instances of AvroValue.

> add wrapper for Avro data
> -------------------------
>
>                 Key: HADOOP-6660
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6660
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Doug Cutting
>             Fix For: 0.22.0
>
>
> To permit passing Avro data through MapReduce we can add a wrapper class and serialization for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6660) add wrapper for Avro data

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849906#action_12849906 ] 

Doug Cutting commented on HADOOP-6660:
--------------------------------------

We might have static methods to configure a job for use with Avro data e.g.:

{code}
public static void setMapOutputKeyGeneric(Configuration, Schema);
public static void setMapOutputKeySpecific(Configuration, Schema);
public static void setOutputKeyGeneric(Configuration, Schema);
public static void setOutputKeySpecific(Configuration, Schema);
public static void setMapOutputValueGeneric(Configuration, Schema);
public static void setMapOutputValueSpecific(Configuration, Schema);
public static void setOutputValueGeneric(Configuration, Schema);
public static void setOutputValueSpecific(Configuration, Schema);
{code}

These would set in the configuration:
 - the appropriate output class name key (key/value map/reduce)
 - the appropriate schema key (key/value map/reduce)
 - the appropriate specific/generic key (key/value map/reduce)
Thus, in total, there would be twelve keys.

The serialization implementation would then use these to construct an appropriate deserializer.

These methods would not depend on any MapReduce code, and should reside with the serialization implementation, so I propose to add them here.


> add wrapper for Avro data
> -------------------------
>
>                 Key: HADOOP-6660
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6660
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Doug Cutting
>             Fix For: 0.22.0
>
>
> To permit passing Avro data through MapReduce we can add a wrapper class and serialization for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-6660) add wrapper for Avro data

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting resolved HADOOP-6660.
----------------------------------

       Resolution: Duplicate
    Fix Version/s:     (was: 0.22.0)

> add wrapper for Avro data
> -------------------------
>
>                 Key: HADOOP-6660
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6660
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Doug Cutting
>
> To permit passing Avro data through MapReduce we can add a wrapper class and serialization for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6660) add Writable wrapper for Avro data

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849879#action_12849879 ] 

Doug Cutting commented on HADOOP-6660:
--------------------------------------

A feature of this proposal is that it can easily be ported to 0.20-based releases.

We would also define a RawComparator and use WritableCompator#define() to register this comparator.

Two Writable classes would be implemented:
{code}
public class AvroValue implements Writable {
  private Object value;
  private Schema schema;
  private DatumWriter writer;
  private DatumReader reader;
  ...
}
// marker class for keys
public class AvroKey extends AvroValue implements Comparable {}
{code}

Instances would be constructed with a factory, e.g.:

{code}
public class AvroValueFactory {
  public static makeGeneric(Schema schema);
  public static makeSpecific(Schema schema);

  public AvroValue value(Object datum);
  public AvroKey key(Object datum);
}
{code}

A SerializationFactory would be defined for these classes.  MapReduce passes the serialization factory a class name and a configuration.  One of four configuration properties would be used to identify the schema, based on whether the class name is AvroKey or AvroValue and whether the configuration specifies mapred.task.is.map or not, thus permitting jobs to specify distinct Avro schemas for map and reduce output keys and values.


> add Writable wrapper for Avro data
> ----------------------------------
>
>                 Key: HADOOP-6660
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6660
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Doug Cutting
>             Fix For: 0.22.0
>
>
> To permit passing Avro data through MapReduce we can add a wrapper class that implements Writable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6660) add Writable wrapper for Avro data

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849876#action_12849876 ] 

Zheng Shao commented on HADOOP-6660:
------------------------------------

How does the Writable class get the metadata (like type information)?
Are we going to serialize metadata with data for each key(or value)? That will be a loss of efficiency.



> add Writable wrapper for Avro data
> ----------------------------------
>
>                 Key: HADOOP-6660
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6660
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Doug Cutting
>             Fix For: 0.22.0
>
>
> To permit passing Avro data through MapReduce we can add a wrapper class that implements Writable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.