You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Mithila Nagendra <mn...@asu.edu> on 2010/08/25 18:40:25 UTC

Custom partitioner for hadoop

I came across the tutorial on creating a custom partitioner on Hadoop (
http://philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/)
I
am trying to create my own partitioner on Hadoop, and the above blog
has given me a good starting point.

I had a question on the partitioner. In the code given in the blog they
have:

if( nbOccurences < 3 )
       return 0;
else
       return 1;

I want to do something similar, but I need the key to be in a range, like
following:

if(nbOccurences>lbrange0 &&  nbOccurences < ubrange0 )
       return 0;
if(nbOccurences>lbrange1 &&  nbOccurences < ubrange1 )
       return 1;

The range boundaries lbrange0, lbrange1, ubrange0, ubrange1 are calculated
by reading a histogram that is stored on the HDFS. I initially thought I can
read the histogram from the customPartitioner class and calculate the range
boundaries, but then in this case the ranges get recalculated for every
<K,V> pair emitted by the mapper. In order to avoid this I was thinking of
passing the range boundaries to the partitioner. How would I do that? Is
there an alternative? Any suggestion would prove useful.

Thank you,

Mithila
Ph.D. Candidate, C.S., Arizona State University

Re: Custom partitioner for hadoop

Posted by Mithila Nagendra <mn...@asu.edu>.
Thanks so much. I took car of it in the mapper.

On Thu, Aug 26, 2010 at 3:26 PM, David Rosenstrauch <da...@darose.net>wrote:

> On 08/26/2010 05:47 PM, Mithila Nagendra wrote:
>
>> Thank you so much for a response. I had one last question.
>>
>> What if I don't want a particular<K, V>  pair to be put into a partition?
>> For example, if K=5, then I want the partitioner to skip this Key. How
>> would
>> I do this? I tried to return -1 when I don't want a key to go to any
>> partition, but that causes an "illegal partition" error. How would I do
>> this?
>>
>> Thanks!
>> Mithila
>>
>
> You wouldn't do that in the partitioner.  Filtering out specific
> records/keys should occur in the mapper or reducer.  (And in general
> conceptually, the mapper would probably be the more appropriate place,
> though it depends on your specific application.)
>
> DR
>

Re: Custom partitioner for hadoop

Posted by David Rosenstrauch <da...@darose.net>.
On 08/26/2010 05:47 PM, Mithila Nagendra wrote:
> Thank you so much for a response. I had one last question.
>
> What if I don't want a particular<K, V>  pair to be put into a partition?
> For example, if K=5, then I want the partitioner to skip this Key. How would
> I do this? I tried to return -1 when I don't want a key to go to any
> partition, but that causes an "illegal partition" error. How would I do
> this?
>
> Thanks!
> Mithila

You wouldn't do that in the partitioner.  Filtering out specific 
records/keys should occur in the mapper or reducer.  (And in general 
conceptually, the mapper would probably be the more appropriate place, 
though it depends on your specific application.)

DR

Re: Custom partitioner for hadoop

Posted by Mithila Nagendra <mn...@asu.edu>.
Thank you so much for a response. I had one last question.

What if I don't want a particular <K, V> pair to be put into a partition?
For example, if K=5, then I want the partitioner to skip this Key. How would
I do this? I tried to return -1 when I don't want a key to go to any
partition, but that causes an "illegal partition" error. How would I do
this?

Thanks!
Mithila

On Wed, Aug 25, 2010 at 1:38 PM, David Rosenstrauch <da...@darose.net>wrote:

> If you define a Hadoop object as implementing Configurable, then its
> setConf() method will be called once, right after it gets instantiated.  So
> each partitioner that gets instantiated will have its setConf() method
> called right afterwards.
>
> I'm taking advantage of that fact by calling my own (private) "configure()"
> method when the Partitioner gets its configuration.  So in that configure
> method, you would grab the ranges from out of the configuration object.
>
> The flip side of this is that your ranges won't just magically appear in
> the configuration object.  You'll have to set them on the configuration
> object used in the Job that you're submitting.
>
> A copy of the job's config object will then get passed to each task in your
> job, which you can then use to configure that task.
>
> HTH,
>
> DR
>
>
> On 08/25/2010 04:23 PM, Mithila Nagendra wrote:
>
>> In which of the three functions would I have to set the ranges? In the
>> configure function? Would the configure be called once for every mapper?
>> Thank you!
>>
>> On Wed, Aug 25, 2010 at 12:50 PM, David Rosenstrauch<darose@darose.net
>> >wrote:
>>
>>  On 08/25/2010 12:40 PM, Mithila Nagendra wrote:
>>>
>>>  In order to avoid this I was thinking of
>>>> passing the range boundaries to the partitioner. How would I do that? Is
>>>> there an alternative? Any suggestion would prove useful.
>>>>
>>>>
>>> We use a custom partitioner, for which we pass in configuration data that
>>> gets used in the partitioning calculations.
>>>
>>> We do it by making the Partitioner implement Configurable, and then grab
>>> the needed config data from the configuration object that we're given.
>>> (We
>>> set the needed config data on the config object when we submit the job).
>>>
>>

Re: Custom partitioner for hadoop

Posted by David Rosenstrauch <da...@darose.net>.
If you define a Hadoop object as implementing Configurable, then its 
setConf() method will be called once, right after it gets instantiated. 
  So each partitioner that gets instantiated will have its setConf() 
method called right afterwards.

I'm taking advantage of that fact by calling my own (private) 
"configure()" method when the Partitioner gets its configuration.  So in 
that configure method, you would grab the ranges from out of the 
configuration object.

The flip side of this is that your ranges won't just magically appear in 
the configuration object.  You'll have to set them on the configuration 
object used in the Job that you're submitting.

A copy of the job's config object will then get passed to each task in 
your job, which you can then use to configure that task.

HTH,

DR

On 08/25/2010 04:23 PM, Mithila Nagendra wrote:
> In which of the three functions would I have to set the ranges? In the
> configure function? Would the configure be called once for every mapper?
> Thank you!
>
> On Wed, Aug 25, 2010 at 12:50 PM, David Rosenstrauch<da...@darose.net>wrote:
>
>> On 08/25/2010 12:40 PM, Mithila Nagendra wrote:
>>
>>> In order to avoid this I was thinking of
>>> passing the range boundaries to the partitioner. How would I do that? Is
>>> there an alternative? Any suggestion would prove useful.
>>>
>>
>> We use a custom partitioner, for which we pass in configuration data that
>> gets used in the partitioning calculations.
>>
>> We do it by making the Partitioner implement Configurable, and then grab
>> the needed config data from the configuration object that we're given. (We
>> set the needed config data on the config object when we submit the job).

Re: Custom partitioner for hadoop

Posted by Mithila Nagendra <mn...@asu.edu>.
In which of the three functions would I have to set the ranges? In the
configure function? Would the configure be called once for every mapper?
Thank you!

On Wed, Aug 25, 2010 at 12:50 PM, David Rosenstrauch <da...@darose.net>wrote:

> On 08/25/2010 12:40 PM, Mithila Nagendra wrote:
>
>> In order to avoid this I was thinking of
>> passing the range boundaries to the partitioner. How would I do that? Is
>> there an alternative? Any suggestion would prove useful.
>>
>
> We use a custom partitioner, for which we pass in configuration data that
> gets used in the partitioning calculations.
>
> We do it by making the Partitioner implement Configurable, and then grab
> the needed config data from the configuration object that we're given. (We
> set the needed config data on the config object when we submit the job).
>  i.e., like so:
>
> import org.apache.hadoop.mapreduce.Partitioner;
> import org.apache.hadoop.conf.Configurable;
> import org.apache.hadoop.conf.Configuration;
>
> public class OurPartitioner extends Partitioner<BytesWritable, Writable>
> implements Configurable {
> ...
>
>        public int getPartition(BytesWritable key, Writable value, int
> numPartitions) {
> ...
>        }
>
>        public Configuration getConf() {
>                return conf;
>        }
>
>        public void setConf(Configuration conf) {
>                this.conf = conf;
>
>                configure();
>        }
>
>        @SuppressWarnings("unchecked")
>        private void configure() throws IOException {
>                String <parmValue> = conf.get(<parmKey>);
>                if (<parmValue> == null) {
>                        throw new RuntimeException(.....);
>                }
>        }
>
>        private Configuration conf;
> }
>
> HTH,
>
> DR
>

Re: Custom partitioner for hadoop

Posted by David Rosenstrauch <da...@darose.net>.
On 08/25/2010 12:40 PM, Mithila Nagendra wrote:
> In order to avoid this I was thinking of
> passing the range boundaries to the partitioner. How would I do that? Is
> there an alternative? Any suggestion would prove useful.

We use a custom partitioner, for which we pass in configuration data 
that gets used in the partitioning calculations.

We do it by making the Partitioner implement Configurable, and then grab 
the needed config data from the configuration object that we're given. 
(We set the needed config data on the config object when we submit the 
job).  i.e., like so:

import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.conf.Configurable;
import org.apache.hadoop.conf.Configuration;

public class OurPartitioner extends Partitioner<BytesWritable, Writable> 
implements Configurable {
...

	public int getPartition(BytesWritable key, Writable value, int 
numPartitions) {
...
	}

	public Configuration getConf() {
		return conf;
	}

	public void setConf(Configuration conf) {
		this.conf = conf;

		configure();
	}

	@SuppressWarnings("unchecked")
	private void configure() throws IOException {
		String <parmValue> = conf.get(<parmKey>);
		if (<parmValue> == null) {
			throw new RuntimeException(.....);
		}
	}

	private Configuration conf;
}

HTH,

DR