You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Chris Douglas (JIRA)" <ji...@apache.org> on 2008/09/10 03:59:44 UTC
[jira] Issue Comment Edited: (HADOOP-4143) Support for a "raw"
Partitioner that partitions based on the serialized key and not record
objects
[ https://issues.apache.org/jira/browse/HADOOP-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629677#action_12629677 ]
chris.douglas edited comment on HADOOP-4143 at 9/9/08 6:58 PM:
---------------------------------------------------------------
There are several advantages to this.
# Partitioners (capable of) working with binary data wouldn't need to define ways of extracting bytes from particular keytypes
# Where defined, RawComparators can be used without serializing the key
# Should be no performance hit for supporting both types of partitioner
# Keys read from a binary reader (e.g. {{SequenceFileAsBinaryInputFormat}}) or emitted as bytes from the map can use a partitioner that understands the underlying type
# Partitioner is already an abstract class in HADOOP-1230 (see following proposal)
And, of course, disadvantages
# Different results if the key contains state determining its partition that is not serialized
# Adds some complexity to {{MapTask}} with few use cases
# There's a (potential) naming issue with this proposal I'm not sure how to resolve
Ideally, a partitioner could handle either raw or object data depending on its configuration. Unfortunately, {{Partitioner}} is an interface, so adding a {{boolean isRaw()}} method would break a *lot* of user code. That said, the interface can be safely deprecated and replaced with an abstract class:
{code}
public abstract class Partitioner<K,V> {
public abstract getPartition(K key, V value, int numPartitions);
public boolean isRaw() { return false; }
public getPartition(byte[] keyBytes, int offset, int length, int numPartitions) {
throw new UnsupportedOperationException("Not a raw Partitioner");
}
}
{code}
A wrapper would be trivial to write, could be a final class, etc. providing backwards compatibility for a few releases. In support of records larger than the serialization buffer, there are at least two solutions:
# Add a deserialization pass in the raw {{getPartition}} (blech)
# Call getPartition after serializing the key and before serializing the value. For a sane set of partitioners, serializers, and value types, this should be compatible with the existing semantics.
Modifications to {{MapTask}} would be minimal. A final bool can determine which overloaded {{getPartition}} is called and there would be the backwards-compatible wrapping of the deprecated Partitioner interface.
Thoughts? This doesn't solve a general problem, but it's useful for partitioning relative to a set of keys and particularly helpful in keeping the API to binary partitioners sane.
was (Author: chris.douglas):
There are several advantages to this.
# Partitioners (capable of) working with binary data wouldn't need to define ways of extracting bytes from particular keytypes
# Where defined, RawComparators can be used without serializing the key
# Should be no performance hit for supporting both types of partitioner
# Keys read from a binary reader (e.g. {{SequenceFileAsBinaryInputFormat}}) or emitted as bytes from the map can use a partitioner that understands the underlying type
# Partitioner is already an abstract class in HADOOP-1230 (see following proposal)
And, of course, disadvantages
# Different results if the key contains state determining its partition that is not serialized
# Adds some complexity to {{MapTask}} with few use cases
# There's a (potential) naming issue with this proposal I'm not sure how to resolve
Ideally, a partitioner could handle either raw or object data depending on its configuration. Unfortunately, {{Partitioner}} is an interface, so adding a {{boolean isRaw()}} method would break a *lot* of user code. That said, the interface can be safely deprecated and replaced with an abstract class:
{code}
public abstract class Partitioner<K,V> {
public abstract getPartition(K key, V value, int numPartitions);
public boolean isRaw() { return false; }
public getPartition(byte[] keyBytes, int offset, int length, int numPartitions) throws IOException {
throw new UnsupportedOperationException("Not a raw Partitioner");
}
}
{code}
A wrapper would be trivial to write, could be a final class, etc. providing backwards compatibility for a few releases. In support of records larger than the serialization buffer, there are at least two solutions:
# Add a deserialization pass in the raw {{getPartition}} (blech)
# Call getPartition after serializing the key and before serializing the value. For a sane set of partitioners, serializers, and value types, this should be compatible with the existing semantics.
Modifications to {{MapTask}} would be minimal. A final bool can determine which overloaded {{getPartition}} is called and there would be the backwards-compatible wrapping of the deprecated Partitioner interface.
Thoughts? This doesn't solve a general problem, but it's useful for partitioning relative to a set of keys and particularly helpful in keeping the API to binary partitioners sane.
> Support for a "raw" Partitioner that partitions based on the serialized key and not record objects
> --------------------------------------------------------------------------------------------------
>
> Key: HADOOP-4143
> URL: https://issues.apache.org/jira/browse/HADOOP-4143
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: Chris Douglas
>
> For some partitioners (particularly those using comparators to classify keys), it would be helpful if one could specify a "raw" partitioner that would receive the serialized version of the key rather than the object emitted from the map.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.