You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Chris Douglas (JIRA)" <ji...@apache.org> on 2008/09/10 03:55:44 UTC

[jira] Created: (HADOOP-4143) Support for a "raw" Partitioner that partitions based on the serialized key and not record objects

Support for a "raw" Partitioner that partitions based on the serialized key and not record objects
--------------------------------------------------------------------------------------------------

                 Key: HADOOP-4143
                 URL: https://issues.apache.org/jira/browse/HADOOP-4143
             Project: Hadoop Core
          Issue Type: Improvement
          Components: mapred
            Reporter: Chris Douglas


For some partitioners (particularly those using comparators to classify keys), it would be helpful if one could specify a "raw" partitioner that would receive the serialized version of the key rather than the object emitted from the map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4143) Support for a "raw" Partitioner that partitions based on the serialized key and not record objects

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629895#action_12629895 ] 

Doug Cutting commented on HADOOP-4143:
--------------------------------------

> Unfortunately, Partitioner is an interface [ ... ] 

This is a non-issue with HADOOP-1230.  If this feature is needed soon, then we should push harder on that one.

That said, I'm still not clear on the motivation.  Is it performance?  Comparators already provide both raw and cooked comparisons.  If a partitioner is defined in terms of a comparator, it must currently used a cooked comparison, which might be slower.  If this is a performance issue, then we should measure the potential performance improvement with a benchmark before we consider the API change.  Are there non-performance reasons for this change?

> Support for a "raw" Partitioner that partitions based on the serialized key and not record objects
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4143
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4143
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris Douglas
>         Attachments: 4143-0.patch
>
>
> For some partitioners (particularly those using comparators to classify keys), it would be helpful if one could specify a "raw" partitioner that would receive the serialized version of the key rather than the object emitted from the map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4143) Support for a "raw" Partitioner that partitions based on the serialized key and not record objects

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629908#action_12629908 ] 

Chris Douglas commented on HADOOP-4143:
---------------------------------------

The performance reasons are pretty limited to "memcmp" types like Text and BytesWritable. Since the partitioner is called from collect when we still have the cooked records, the only motivation would be in support of partitioners like the one used in the terasort example. I talked offline with Owen about this, and he makes the case that a "MemComparable" interface to the aforementioned types would probably be more than sufficient for practical uses, more readable than the partitioner handling different/layered length encodings, and a more general abstraction than this is.

The only remaining reason would be the aforementioned space/time tradeoff, saving an int per record while adding a call to the partitioner for each compare in the sort. If this effected any improvement in running time, it would probably be noise at best and likely inferior to a better configuration.

I don't usually like "tagging" types, but the MemComparable interface will not only resolve any case this would, but could also help with RawComparator impl, table stores, etc. This was conceived as a way to avoid that, but it's clearly not an improvement on it and should probably be closed as "Won't fix".

> Support for a "raw" Partitioner that partitions based on the serialized key and not record objects
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4143
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4143
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris Douglas
>         Attachments: 4143-0.patch
>
>
> For some partitioners (particularly those using comparators to classify keys), it would be helpful if one could specify a "raw" partitioner that would receive the serialized version of the key rather than the object emitted from the map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4143) Support for a "raw" Partitioner that partitions based on the serialized key and not record objects

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-4143:
----------------------------------

    Attachment: 4143-0.patch

Attaching a quick hack of the preceding.

> Support for a "raw" Partitioner that partitions based on the serialized key and not record objects
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4143
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4143
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris Douglas
>         Attachments: 4143-0.patch
>
>
> For some partitioners (particularly those using comparators to classify keys), it would be helpful if one could specify a "raw" partitioner that would receive the serialized version of the key rather than the object emitted from the map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4143) Support for a "raw" Partitioner that partitions based on the serialized key and not record objects

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629896#action_12629896 ] 

Chris Douglas commented on HADOOP-4143:
---------------------------------------

Of course, this would also permit jobs that use raw partitioners to use only 12 bytes per record instead of 16, with additional cost to the compare during the sort (probably not worthwhile)

> Support for a "raw" Partitioner that partitions based on the serialized key and not record objects
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4143
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4143
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris Douglas
>         Attachments: 4143-0.patch
>
>
> For some partitioners (particularly those using comparators to classify keys), it would be helpful if one could specify a "raw" partitioner that would receive the serialized version of the key rather than the object emitted from the map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-4143) Support for a "raw" Partitioner that partitions based on the serialized key and not record objects

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629908#action_12629908 ] 

chris.douglas edited comment on HADOOP-4143 at 9/10/08 11:48 AM:
-----------------------------------------------------------------

The performance reasons are pretty limited to "memcmp" types like Text and BytesWritable. Since the partitioner is called from collect when we still have the cooked records, the only motivation would be in support of partitioners like the one used in the terasort example. I talked offline with Owen about this, and he makes the case that a "MemComparable" interface to the aforementioned types would probably be more than sufficient for practical uses, more readable than the partitioner handling different/layered length encodings, and a more general abstraction than this is.

The only remaining reason would be the aforementioned space/time tradeoff, saving an int per record while adding a call to the partitioner for each compare in the sort. If this effected any improvement in running time, it would probably be noise at best and likely inferior to a better configuration.

I don't usually like "tagging" types, but the MemComparable interface will not only resolve any case this would, but could also help with RawComparator impl, table stores, etc. This was conceived as a way to avoid that, but it's clearly not an improvement on it and should probably be closed.

      was (Author: chris.douglas):
    The performance reasons are pretty limited to "memcmp" types like Text and BytesWritable. Since the partitioner is called from collect when we still have the cooked records, the only motivation would be in support of partitioners like the one used in the terasort example. I talked offline with Owen about this, and he makes the case that a "MemComparable" interface to the aforementioned types would probably be more than sufficient for practical uses, more readable than the partitioner handling different/layered length encodings, and a more general abstraction than this is.

The only remaining reason would be the aforementioned space/time tradeoff, saving an int per record while adding a call to the partitioner for each compare in the sort. If this effected any improvement in running time, it would probably be noise at best and likely inferior to a better configuration.

I don't usually like "tagging" types, but the MemComparable interface will not only resolve any case this would, but could also help with RawComparator impl, table stores, etc. This was conceived as a way to avoid that, but it's clearly not an improvement on it and should probably be closed as "Won't fix".
  
> Support for a "raw" Partitioner that partitions based on the serialized key and not record objects
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4143
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4143
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris Douglas
>         Attachments: 4143-0.patch
>
>
> For some partitioners (particularly those using comparators to classify keys), it would be helpful if one could specify a "raw" partitioner that would receive the serialized version of the key rather than the object emitted from the map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HADOOP-4143) Support for a "raw" Partitioner that partitions based on the serialized key and not record objects

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley resolved HADOOP-4143.
-----------------------------------

    Resolution: Won't Fix

I don't think the gain is worth the additional interface. I would propose adding a BinaryComparable interface that has:

{code}
interface BinaryComparable {
  int getLength();
  byte[] getBytes();
}
{code}

and both Text and BytesWritable should implement it.

> Support for a "raw" Partitioner that partitions based on the serialized key and not record objects
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4143
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4143
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris Douglas
>         Attachments: 4143-0.patch
>
>
> For some partitioners (particularly those using comparators to classify keys), it would be helpful if one could specify a "raw" partitioner that would receive the serialized version of the key rather than the object emitted from the map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4143) Support for a "raw" Partitioner that partitions based on the serialized key and not record objects

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629677#action_12629677 ] 

Chris Douglas commented on HADOOP-4143:
---------------------------------------

There are several advantages to this.
# Partitioners (capable of) working with binary data wouldn't need to define ways of extracting bytes from particular keytypes
# Where defined, RawComparators can be used without serializing the key
# Should be no performance hit for supporting both types of partitioner
# Keys read from a binary reader (e.g. {{SequenceFileAsBinaryInputFormat}}) or emitted as bytes from the map can use a partitioner that understands the underlying type
# Partitioner is already an abstract class in HADOOP-1230 (see following proposal)

And, of course, disadvantages
# Different results if the key contains state determining its partition that is not serialized
# Adds some complexity to {{MapTask}} with few use cases
# There's a (potential) naming issue with this proposal I'm not sure how to resolve

Ideally, a partitioner could handle either raw or object data depending on its configuration. Unfortunately, {{Partitioner}} is an interface, so adding a {{boolean isRaw()}} method would break a *lot* of user code. That said, the interface can be safely deprecated and replaced with an abstract class:
{code}
public abstract class Partitioner<K,V> {
  public abstract getPartition(K key, V value, int numPartitions);
  public boolean isRaw() { return false; }
  public getPartition(byte[] keyBytes, int offset, int length, int numPartitions) throws IOException {
    throw new UnsupportedOperationException("Not a raw Partitioner");
  }
}
{code}

A wrapper would be trivial to write, could be a final class, etc. providing backwards compatibility for a few releases. In support of records larger than the serialization buffer, there are at least two solutions:
# Add a deserialization pass in the raw {{getPartition}} (blech)
# Call getPartition after serializing the key and before serializing the value. For a sane set of partitioners, serializers, and value types, this should be compatible with the existing semantics.

Modifications to {{MapTask}} would be minimal. A final bool can determine which overloaded {{getPartition}} is called and there would be the backwards-compatible wrapping of the deprecated Partitioner interface.

Thoughts? This doesn't solve a general problem, but it's useful for partitioning relative to a set of keys and particularly helpful in keeping the API to binary partitioners sane.

> Support for a "raw" Partitioner that partitions based on the serialized key and not record objects
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4143
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4143
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris Douglas
>
> For some partitioners (particularly those using comparators to classify keys), it would be helpful if one could specify a "raw" partitioner that would receive the serialized version of the key rather than the object emitted from the map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-4143) Support for a "raw" Partitioner that partitions based on the serialized key and not record objects

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629677#action_12629677 ] 

chris.douglas edited comment on HADOOP-4143 at 9/9/08 6:58 PM:
---------------------------------------------------------------

There are several advantages to this.
# Partitioners (capable of) working with binary data wouldn't need to define ways of extracting bytes from particular keytypes
# Where defined, RawComparators can be used without serializing the key
# Should be no performance hit for supporting both types of partitioner
# Keys read from a binary reader (e.g. {{SequenceFileAsBinaryInputFormat}}) or emitted as bytes from the map can use a partitioner that understands the underlying type
# Partitioner is already an abstract class in HADOOP-1230 (see following proposal)

And, of course, disadvantages
# Different results if the key contains state determining its partition that is not serialized
# Adds some complexity to {{MapTask}} with few use cases
# There's a (potential) naming issue with this proposal I'm not sure how to resolve

Ideally, a partitioner could handle either raw or object data depending on its configuration. Unfortunately, {{Partitioner}} is an interface, so adding a {{boolean isRaw()}} method would break a *lot* of user code. That said, the interface can be safely deprecated and replaced with an abstract class:
{code}
public abstract class Partitioner<K,V> {
  public abstract getPartition(K key, V value, int numPartitions);
  public boolean isRaw() { return false; }
  public getPartition(byte[] keyBytes, int offset, int length, int numPartitions) {
    throw new UnsupportedOperationException("Not a raw Partitioner");
  }
}
{code}

A wrapper would be trivial to write, could be a final class, etc. providing backwards compatibility for a few releases. In support of records larger than the serialization buffer, there are at least two solutions:
# Add a deserialization pass in the raw {{getPartition}} (blech)
# Call getPartition after serializing the key and before serializing the value. For a sane set of partitioners, serializers, and value types, this should be compatible with the existing semantics.

Modifications to {{MapTask}} would be minimal. A final bool can determine which overloaded {{getPartition}} is called and there would be the backwards-compatible wrapping of the deprecated Partitioner interface.

Thoughts? This doesn't solve a general problem, but it's useful for partitioning relative to a set of keys and particularly helpful in keeping the API to binary partitioners sane.

      was (Author: chris.douglas):
    There are several advantages to this.
# Partitioners (capable of) working with binary data wouldn't need to define ways of extracting bytes from particular keytypes
# Where defined, RawComparators can be used without serializing the key
# Should be no performance hit for supporting both types of partitioner
# Keys read from a binary reader (e.g. {{SequenceFileAsBinaryInputFormat}}) or emitted as bytes from the map can use a partitioner that understands the underlying type
# Partitioner is already an abstract class in HADOOP-1230 (see following proposal)

And, of course, disadvantages
# Different results if the key contains state determining its partition that is not serialized
# Adds some complexity to {{MapTask}} with few use cases
# There's a (potential) naming issue with this proposal I'm not sure how to resolve

Ideally, a partitioner could handle either raw or object data depending on its configuration. Unfortunately, {{Partitioner}} is an interface, so adding a {{boolean isRaw()}} method would break a *lot* of user code. That said, the interface can be safely deprecated and replaced with an abstract class:
{code}
public abstract class Partitioner<K,V> {
  public abstract getPartition(K key, V value, int numPartitions);
  public boolean isRaw() { return false; }
  public getPartition(byte[] keyBytes, int offset, int length, int numPartitions) throws IOException {
    throw new UnsupportedOperationException("Not a raw Partitioner");
  }
}
{code}

A wrapper would be trivial to write, could be a final class, etc. providing backwards compatibility for a few releases. In support of records larger than the serialization buffer, there are at least two solutions:
# Add a deserialization pass in the raw {{getPartition}} (blech)
# Call getPartition after serializing the key and before serializing the value. For a sane set of partitioners, serializers, and value types, this should be compatible with the existing semantics.

Modifications to {{MapTask}} would be minimal. A final bool can determine which overloaded {{getPartition}} is called and there would be the backwards-compatible wrapping of the deprecated Partitioner interface.

Thoughts? This doesn't solve a general problem, but it's useful for partitioning relative to a set of keys and particularly helpful in keeping the API to binary partitioners sane.
  
> Support for a "raw" Partitioner that partitions based on the serialized key and not record objects
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4143
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4143
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris Douglas
>
> For some partitioners (particularly those using comparators to classify keys), it would be helpful if one could specify a "raw" partitioner that would receive the serialized version of the key rather than the object emitted from the map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.