You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Chris Douglas (JIRA)" <ji...@apache.org> on 2008/09/10 03:59:44 UTC
[jira] Issue Comment Edited: (HADOOP-4143) Support for a "raw" Partitioner that partitions based on the serialized key and not record objects

    [ https://issues.apache.org/jira/browse/HADOOP-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629677#action_12629677 ] 

chris.douglas edited comment on HADOOP-4143 at 9/9/08 6:58 PM:
---------------------------------------------------------------

There are several advantages to this.
# Partitioners (capable of) working with binary data wouldn't need to define ways of extracting bytes from particular keytypes
# Where defined, RawComparators can be used without serializing the key
# Should be no performance hit for supporting both types of partitioner
# Keys read from a binary reader (e.g. {{SequenceFileAsBinaryInputFormat}}) or emitted as bytes from the map can use a partitioner that understands the underlying type
# Partitioner is already an abstract class in HADOOP-1230 (see following proposal)

And, of course, disadvantages
# Different results if the key contains state determining its partition that is not serialized
# Adds some complexity to {{MapTask}} with few use cases
# There's a (potential) naming issue with this proposal I'm not sure how to resolve

Ideally, a partitioner could handle either raw or object data depending on its configuration. Unfortunately, {{Partitioner}} is an interface, so adding a {{boolean isRaw()}} method would break a *lot* of user code. That said, the interface can be safely deprecated and replaced with an abstract class:
{code}
public abstract class Partitioner<K,V> {
  public abstract getPartition(K key, V value, int numPartitions);
  public boolean isRaw() { return false; }
  public getPartition(byte[] keyBytes, int offset, int length, int numPartitions) {
    throw new UnsupportedOperationException("Not a raw Partitioner");
  }
}
{code}

A wrapper would be trivial to write, could be a final class, etc. providing backwards compatibility for a few releases. In support of records larger than the serialization buffer, there are at least two solutions:
# Add a deserialization pass in the raw {{getPartition}} (blech)
# Call getPartition after serializing the key and before serializing the value. For a sane set of partitioners, serializers, and value types, this should be compatible with the existing semantics.

Modifications to {{MapTask}} would be minimal. A final bool can determine which overloaded {{getPartition}} is called and there would be the backwards-compatible wrapping of the deprecated Partitioner interface.

Thoughts? This doesn't solve a general problem, but it's useful for partitioning relative to a set of keys and particularly helpful in keeping the API to binary partitioners sane.

      was (Author: chris.douglas):
    There are several advantages to this.
# Partitioners (capable of) working with binary data wouldn't need to define ways of extracting bytes from particular keytypes
# Where defined, RawComparators can be used without serializing the key
# Should be no performance hit for supporting both types of partitioner
# Keys read from a binary reader (e.g. {{SequenceFileAsBinaryInputFormat}}) or emitted as bytes from the map can use a partitioner that understands the underlying type
# Partitioner is already an abstract class in HADOOP-1230 (see following proposal)

And, of course, disadvantages
# Different results if the key contains state determining its partition that is not serialized
# Adds some complexity to {{MapTask}} with few use cases
# There's a (potential) naming issue with this proposal I'm not sure how to resolve

Ideally, a partitioner could handle either raw or object data depending on its configuration. Unfortunately, {{Partitioner}} is an interface, so adding a {{boolean isRaw()}} method would break a *lot* of user code. That said, the interface can be safely deprecated and replaced with an abstract class:
{code}
public abstract class Partitioner<K,V> {
  public abstract getPartition(K key, V value, int numPartitions);
  public boolean isRaw() { return false; }
  public getPartition(byte[] keyBytes, int offset, int length, int numPartitions) throws IOException {
    throw new UnsupportedOperationException("Not a raw Partitioner");
  }
}
{code}

A wrapper would be trivial to write, could be a final class, etc. providing backwards compatibility for a few releases. In support of records larger than the serialization buffer, there are at least two solutions:
# Add a deserialization pass in the raw {{getPartition}} (blech)
# Call getPartition after serializing the key and before serializing the value. For a sane set of partitioners, serializers, and value types, this should be compatible with the existing semantics.

Modifications to {{MapTask}} would be minimal. A final bool can determine which overloaded {{getPartition}} is called and there would be the backwards-compatible wrapping of the deprecated Partitioner interface.

Thoughts? This doesn't solve a general problem, but it's useful for partitioning relative to a set of keys and particularly helpful in keeping the API to binary partitioners sane.
  
> Support for a "raw" Partitioner that partitions based on the serialized key and not record objects
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4143
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4143
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris Douglas
>
> For some partitioners (particularly those using comparators to classify keys), it would be helpful if one could specify a "raw" partitioner that would receive the serialized version of the key rather than the object emitted from the map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.