You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by java8964 java8964 <ja...@hotmail.com> on 2012/09/20 21:01:39 UTC

why hadoop does not provide a round robin partitioner

Hi,
During my development of ETLs on hadoop platform, there is one question I want to ask, why hadoop didn't provide a round robin partitioner?
>From my experience, it is very powerful option for small limited distinct value keys case, and balance the ETL resource. Here is what I want to say:
1) Sometimes, you will have an ETL with small number of Keys, for example, partitioned the data by Dates, or by Hours etc. So in every ETL load, I will have very limited count of unique key values (Maybe 10, if I load 10 days data, or 24 if I load one days data and use the hour as the key).2) The HashPartitioner is good, given it will randomly generate the partitioner number, if you have a large number of distinct keys.3) A lot of times, I have enough spare reducers, but because the hashCode() method happens to return several keys into one partitioner, all the data of those keys will go to the same reducer process, which is not very efficiently as there are some spare reducers just happen to get nothing to do.4) Of course I can implement my own partitioner to control this, but I wonder it should not to be too harder to implements a round robin partitioner as in general case, which will equally distribute the different keys into the available reducers. Of course, with the distinct count of keys grows, the performance of this partitioner decrease badly. But if we know the count of distinct keys is small enough, use this kind of parittioner will be a good option, right?
Thanks
Yong

Re: why hadoop does not provide a round robin partitioner

Posted by Ted Dunning <td...@maprtech.com>.

The simplest solution for the situation as stated is to use an identity
hash function.  Of course, you can't split things any finer than the number
of keys with this approach.

If you can process different time periods independently, you may be able to
add a small number of bits to your key to get lots of bins which will then
be split relatively evenly.  If you can do this, however, you probably can
use a combiner and get better results.

On Thu, Sep 20, 2012 at 3:21 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> If I am correctly understanding, you are saying that given you know your
> data, the provided hash function does not distribute it uniformly enough.
> The answer to do that is to implement a better hash function. You could
> built it generically if you can provide the partitioner with stats about
> its inputs. But that would not be into Hadoop scope. You should look at
> Hive/Pig or something equivalent.
>

Re: why hadoop does not provide a round robin partitioner

Posted by Ted Dunning <td...@maprtech.com>.

The simplest solution for the situation as stated is to use an identity
hash function.  Of course, you can't split things any finer than the number
of keys with this approach.

If you can process different time periods independently, you may be able to
add a small number of bits to your key to get lots of bins which will then
be split relatively evenly.  If you can do this, however, you probably can
use a combiner and get better results.

On Thu, Sep 20, 2012 at 3:21 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> If I am correctly understanding, you are saying that given you know your
> data, the provided hash function does not distribute it uniformly enough.
> The answer to do that is to implement a better hash function. You could
> built it generically if you can provide the partitioner with stats about
> its inputs. But that would not be into Hadoop scope. You should look at
> Hive/Pig or something equivalent.
>

Re: why hadoop does not provide a round robin partitioner

Posted by Ted Dunning <td...@maprtech.com>.

The simplest solution for the situation as stated is to use an identity
hash function.  Of course, you can't split things any finer than the number
of keys with this approach.

If you can process different time periods independently, you may be able to
add a small number of bits to your key to get lots of bins which will then
be split relatively evenly.  If you can do this, however, you probably can
use a combiner and get better results.

On Thu, Sep 20, 2012 at 3:21 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> If I am correctly understanding, you are saying that given you know your
> data, the provided hash function does not distribute it uniformly enough.
> The answer to do that is to implement a better hash function. You could
> built it generically if you can provide the partitioner with stats about
> its inputs. But that would not be into Hadoop scope. You should look at
> Hive/Pig or something equivalent.
>

Re: why hadoop does not provide a round robin partitioner

Posted by Ted Dunning <td...@maprtech.com>.

The simplest solution for the situation as stated is to use an identity
hash function.  Of course, you can't split things any finer than the number
of keys with this approach.

If you can process different time periods independently, you may be able to
add a small number of bits to your key to get lots of bins which will then
be split relatively evenly.  If you can do this, however, you probably can
use a combiner and get better results.

On Thu, Sep 20, 2012 at 3:21 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> If I am correctly understanding, you are saying that given you know your
> data, the provided hash function does not distribute it uniformly enough.
> The answer to do that is to implement a better hash function. You could
> built it generically if you can provide the partitioner with stats about
> its inputs. But that would not be into Hadoop scope. You should look at
> Hive/Pig or something equivalent.
>

Re: why hadoop does not provide a round robin partitioner

Posted by Bertrand Dechoux <de...@gmail.com>.

I am not sure what you mean.

I asume that by round robin you want the first key value to go to the first
reducer, the second to the second... modulo the number of reducers. I don't
think you will have access to the rank of the values. You could have a
state into your partitioner but I don't think you have any garante that
always the same instance of your partitioner will be used. Anyway if the
map1 emits key1 et key3 and map2 emits key1 and key2 and key3, how would
you ensure that every information about the same key is thrown to the same
reducer?

If I am correctly understanding, you are saying that given you know your
data, the provided hash function does not distribute it uniformly enough.
The answer to do that is to implement a better hash function. You could
built it generically if you can provide the partitioner with stats about
its inputs. But that would not be into Hadoop scope. You should look at
Hive/Pig or something equivalent.

Regards

Bertrand

On Thu, Sep 20, 2012 at 9:01 PM, java8964 java8964 <ja...@hotmail.com>wrote:

>  Hi,
>
> During my development of ETLs on hadoop platform, there is one question I
> want to ask, why hadoop didn't provide a round robin partitioner?
>
> From my experience, it is very powerful option for small limited distinct
> value keys case, and balance the ETL resource. Here is what I want to say:
>
> 1) Sometimes, you will have an ETL with small number of Keys, for example,
> partitioned the data by Dates, or by Hours etc. So in every ETL load, I
> will have very limited count of unique key values (Maybe 10, if I load 10
> days data, or 24 if I load one days data and use the hour as the key).
> 2) The HashPartitioner is good, given it will randomly generate the
> partitioner number, if you have a large number of distinct keys.
> 3) A lot of times, I have enough spare reducers, but because the
> hashCode() method happens to return several keys into one partitioner, all
> the data of those keys will go to the same reducer process, which is not
> very efficiently as there are some spare reducers just happen to get
> nothing to do.
> 4) Of course I can implement my own partitioner to control this, but I
> wonder it should not to be too harder to implements a round robin
> partitioner as in general case, which will equally distribute the different
> keys into the available reducers. Of course, with the distinct count of
> keys grows, the performance of this partitioner decrease badly. But if we
> know the count of distinct keys is small enough, use this kind of
> parittioner will be a good option, right?
>
> Thanks
>
> Yong
>

-- 
Bertrand Dechoux

Re: why hadoop does not provide a round robin partitioner

Posted by Bertrand Dechoux <de...@gmail.com>.

I am not sure what you mean.

I asume that by round robin you want the first key value to go to the first
reducer, the second to the second... modulo the number of reducers. I don't
think you will have access to the rank of the values. You could have a
state into your partitioner but I don't think you have any garante that
always the same instance of your partitioner will be used. Anyway if the
map1 emits key1 et key3 and map2 emits key1 and key2 and key3, how would
you ensure that every information about the same key is thrown to the same
reducer?

If I am correctly understanding, you are saying that given you know your
data, the provided hash function does not distribute it uniformly enough.
The answer to do that is to implement a better hash function. You could
built it generically if you can provide the partitioner with stats about
its inputs. But that would not be into Hadoop scope. You should look at
Hive/Pig or something equivalent.

Regards

Bertrand

On Thu, Sep 20, 2012 at 9:01 PM, java8964 java8964 <ja...@hotmail.com>wrote:

>  Hi,
>
> During my development of ETLs on hadoop platform, there is one question I
> want to ask, why hadoop didn't provide a round robin partitioner?
>
> From my experience, it is very powerful option for small limited distinct
> value keys case, and balance the ETL resource. Here is what I want to say:
>
> 1) Sometimes, you will have an ETL with small number of Keys, for example,
> partitioned the data by Dates, or by Hours etc. So in every ETL load, I
> will have very limited count of unique key values (Maybe 10, if I load 10
> days data, or 24 if I load one days data and use the hour as the key).
> 2) The HashPartitioner is good, given it will randomly generate the
> partitioner number, if you have a large number of distinct keys.
> 3) A lot of times, I have enough spare reducers, but because the
> hashCode() method happens to return several keys into one partitioner, all
> the data of those keys will go to the same reducer process, which is not
> very efficiently as there are some spare reducers just happen to get
> nothing to do.
> 4) Of course I can implement my own partitioner to control this, but I
> wonder it should not to be too harder to implements a round robin
> partitioner as in general case, which will equally distribute the different
> keys into the available reducers. Of course, with the distinct count of
> keys grows, the performance of this partitioner decrease badly. But if we
> know the count of distinct keys is small enough, use this kind of
> parittioner will be a good option, right?
>
> Thanks
>
> Yong
>

-- 
Bertrand Dechoux

Re: why hadoop does not provide a round robin partitioner

Posted by Bertrand Dechoux <de...@gmail.com>.

I am not sure what you mean.

I asume that by round robin you want the first key value to go to the first
reducer, the second to the second... modulo the number of reducers. I don't
think you will have access to the rank of the values. You could have a
state into your partitioner but I don't think you have any garante that
always the same instance of your partitioner will be used. Anyway if the
map1 emits key1 et key3 and map2 emits key1 and key2 and key3, how would
you ensure that every information about the same key is thrown to the same
reducer?

If I am correctly understanding, you are saying that given you know your
data, the provided hash function does not distribute it uniformly enough.
The answer to do that is to implement a better hash function. You could
built it generically if you can provide the partitioner with stats about
its inputs. But that would not be into Hadoop scope. You should look at
Hive/Pig or something equivalent.

Regards

Bertrand

On Thu, Sep 20, 2012 at 9:01 PM, java8964 java8964 <ja...@hotmail.com>wrote:

>  Hi,
>
> During my development of ETLs on hadoop platform, there is one question I
> want to ask, why hadoop didn't provide a round robin partitioner?
>
> From my experience, it is very powerful option for small limited distinct
> value keys case, and balance the ETL resource. Here is what I want to say:
>
> 1) Sometimes, you will have an ETL with small number of Keys, for example,
> partitioned the data by Dates, or by Hours etc. So in every ETL load, I
> will have very limited count of unique key values (Maybe 10, if I load 10
> days data, or 24 if I load one days data and use the hour as the key).
> 2) The HashPartitioner is good, given it will randomly generate the
> partitioner number, if you have a large number of distinct keys.
> 3) A lot of times, I have enough spare reducers, but because the
> hashCode() method happens to return several keys into one partitioner, all
> the data of those keys will go to the same reducer process, which is not
> very efficiently as there are some spare reducers just happen to get
> nothing to do.
> 4) Of course I can implement my own partitioner to control this, but I
> wonder it should not to be too harder to implements a round robin
> partitioner as in general case, which will equally distribute the different
> keys into the available reducers. Of course, with the distinct count of
> keys grows, the performance of this partitioner decrease badly. But if we
> know the count of distinct keys is small enough, use this kind of
> parittioner will be a good option, right?
>
> Thanks
>
> Yong
>

-- 
Bertrand Dechoux

Re: why hadoop does not provide a round robin partitioner

Posted by Bertrand Dechoux <de...@gmail.com>.

I am not sure what you mean.

I asume that by round robin you want the first key value to go to the first
reducer, the second to the second... modulo the number of reducers. I don't
think you will have access to the rank of the values. You could have a
state into your partitioner but I don't think you have any garante that
always the same instance of your partitioner will be used. Anyway if the
map1 emits key1 et key3 and map2 emits key1 and key2 and key3, how would
you ensure that every information about the same key is thrown to the same
reducer?

If I am correctly understanding, you are saying that given you know your
data, the provided hash function does not distribute it uniformly enough.
The answer to do that is to implement a better hash function. You could
built it generically if you can provide the partitioner with stats about
its inputs. But that would not be into Hadoop scope. You should look at
Hive/Pig or something equivalent.

Regards

Bertrand

On Thu, Sep 20, 2012 at 9:01 PM, java8964 java8964 <ja...@hotmail.com>wrote:

>  Hi,
>
> During my development of ETLs on hadoop platform, there is one question I
> want to ask, why hadoop didn't provide a round robin partitioner?
>
> From my experience, it is very powerful option for small limited distinct
> value keys case, and balance the ETL resource. Here is what I want to say:
>
> 1) Sometimes, you will have an ETL with small number of Keys, for example,
> partitioned the data by Dates, or by Hours etc. So in every ETL load, I
> will have very limited count of unique key values (Maybe 10, if I load 10
> days data, or 24 if I load one days data and use the hour as the key).
> 2) The HashPartitioner is good, given it will randomly generate the
> partitioner number, if you have a large number of distinct keys.
> 3) A lot of times, I have enough spare reducers, but because the
> hashCode() method happens to return several keys into one partitioner, all
> the data of those keys will go to the same reducer process, which is not
> very efficiently as there are some spare reducers just happen to get
> nothing to do.
> 4) Of course I can implement my own partitioner to control this, but I
> wonder it should not to be too harder to implements a round robin
> partitioner as in general case, which will equally distribute the different
> keys into the available reducers. Of course, with the distinct count of
> keys grows, the performance of this partitioner decrease badly. But if we
> know the count of distinct keys is small enough, use this kind of
> parittioner will be a good option, right?
>
> Thanks
>
> Yong
>

-- 
Bertrand Dechoux