You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jean-Marc Spaggiari <je...@spaggiari.org> on 2013/04/10 14:49:26 UTC

MapReduce: Reducers partitions.

Hi,

quick question. How are the data from the map tasks partitionned for
the reducers?

If there is 1 reducer, it's easy, but if there is more, are all they
same keys garanteed to end on the same reducer? Or not necessary?  If
they are not, how can we provide a partionning function?

Thanks,

JM

Re: MapReduce: Reducers partitions.

Posted by Stack <st...@duboce.net>.

On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Nitin,
>
> You got my question correctly.
>
> However, I'm wondering how it's working when it's done into HBase.



We use the default MapReduce partitioner:
http://hadoop.apache.org/docs/r2.0.3-alpha/api/org/apache/hadoop/mapreduce/lib/partition/HashPartitioner.html



> Do we have defaults partionners ...



No.



> ...so we have the same garantee that records
> mapping to one key go to the same reducer.



This will happen w/ the default partitioner (key is hashed.  hash is always
the same so always goes to same location).



> Or do we have to implement
> this one our own.
>

No.

St.Ack

Re: MapReduce: Reducers partitions.

Posted by Graeme Wallace <gr...@farecompare.com>.

Ok. Thanks.


On Wed, Apr 10, 2013 at 2:01 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Greame,
>
> No. The reducer will simply write on the table the same way you are doing a
> regular Put. If a split is required because of the size, then the region
> will be split, but at the end, there will not necessary be any region
> split.
>
> In the usecase described below, all the 600 lines will "simply" go into the
> only region in the table and no split will occur.
>
> The goal is to partition the data for the reducer only. Not in the table.
>
> JM
>
> 2013/4/10 Graeme Wallace <gr...@farecompare.com>
>
> > Whats the behavior then if you return hash % num_reducers and you have no
> > splits defined. When the reducer writes to the table does the region
> server
> > local to the reducer create a new region ?
> >
> > Graeme
> >
> >
> > On Wed, Apr 10, 2013 at 1:26 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > So.
> > >
> > > I looked at the code, and I have one comment/suggestion here.
> > >
> > > If the table we are outputing to has regions, then partitions are build
> > > around that, and that's fine. But if the table is totally empty with a
> > > single region, even if we setNumReduceTasks to 2 or more, all the keys
> > will
> > > go on the same first reducer because of this:
> > >     if (this.startKeys.length == 1){
> > >       return 0;
> > >     }
> > > I think it will be better to return something like keycrc%numPartitions
> > > instead. That still allow the application to spread the reducing
> process
> > > over multinode(racks) even if there is only one region in the table.
> > >
> > > In my usecase, I have millions of lines producing some statistics. At
> the
> > > end, I will have only about 600 lines, but it will take a lot of map
> and
> > > reduce time to go from millions to 600, that's why I'm looking to have
> > more
> > > than one reducer. However, with only 600 lines, it's very difficult to
> > > pre-split the table. Keys are all very close.
> > >
> > > Does anyone see anything wrong with changing this default behaviour
> when
> > > startKeys.length == 1? If not, I will open a JIRA and upload a patch.
> > >
> > > JM
> > >
> > > 2013/4/10 Jean-Marc Spaggiari <je...@spaggiari.org>
> > >
> > > > Thanks Ted.
> > > >
> > > > It's exactly where I was looking at now. I was close. I will take a
> > > deeper
> > > > look.
> > > >
> > > > Thanks Nitin for the link. I will read that too.
> > > >
> > > > JM
> > > >
> > > > 2013/4/10 Nitin Pawar <ni...@gmail.com>
> > > >
> > > >> To add what Ted said,
> > > >>
> > > >> the same discussion happened on the question Jean asked
> > > >>
> > > >> https://issues.apache.org/jira/browse/HBASE-1287
> > > >>
> > > >>
> > > >> On Wed, Apr 10, 2013 at 7:28 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > > >>
> > > >> > Jean-Marc:
> > > >> > Take a look at HRegionPartitioner which is in both mapred and
> > > mapreduce
> > > >> > packages:
> > > >> >
> > > >> >  * This is used to partition the output keys into groups of keys.
> > > >> >
> > > >> >  * Keys are grouped according to the regions that currently exist
> > > >> >
> > > >> >  * so that each reducer fills a single region so load is
> > distributed.
> > > >> >
> > > >> > Cheers
> > > >> >
> > > >> > On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari <
> > > >> > jean-marc@spaggiari.org> wrote:
> > > >> >
> > > >> > > Hi Nitin,
> > > >> > >
> > > >> > > You got my question correctly.
> > > >> > >
> > > >> > > However, I'm wondering how it's working when it's done into
> HBase.
> > > Do
> > > >> > > we have defaults partionners so we have the same garantee that
> > > records
> > > >> > > mapping to one key go to the same reducer. Or do we have to
> > > implement
> > > >> > > this one our own.
> > > >> > >
> > > >> > > JM
> > > >> > >
> > > >> > > 2013/4/10 Nitin Pawar <ni...@gmail.com>:
> > > >> > > > I hope i understood what you are asking is this . If not then
> > > >> pardon me
> > > >> > > :)
> > > >> > > > from the hadoop developer handbook few lines
> > > >> > > >
> > > >> > > > The*Partitioner* class determines which partition a given
> (key,
> > > >> value)
> > > >> > > pair
> > > >> > > > will go to. The default partitioner computes a hash value for
> > the
> > > >> key
> > > >> > and
> > > >> > > > assigns the partition based on this result. It garantees that
> > all
> > > >> the
> > > >> > > > records mapping to one key go to same reducer
> > > >> > > >
> > > >> > > > You can write your custom partitioner as well
> > > >> > > > here is the link :
> > > >> > > >
> > > >>
> http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari <
> > > >> > > > jean-marc@spaggiari.org> wrote:
> > > >> > > >
> > > >> > > >> Hi,
> > > >> > > >>
> > > >> > > >> quick question. How are the data from the map tasks
> > partitionned
> > > >> for
> > > >> > > >> the reducers?
> > > >> > > >>
> > > >> > > >> If there is 1 reducer, it's easy, but if there is more, are
> all
> > > >> they
> > > >> > > >> same keys garanteed to end on the same reducer? Or not
> > necessary?
> > > >>  If
> > > >> > > >> they are not, how can we provide a partionning function?
> > > >> > > >>
> > > >> > > >> Thanks,
> > > >> > > >>
> > > >> > > >> JM
> > > >> > > >>
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > Nitin Pawar
> > > >> > >
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Nitin Pawar
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Graeme Wallace
> > CTO
> > FareCompare.com
> > O: 972 588 1414
> > M: 214 681 9018
> >
>



-- 
Graeme Wallace
CTO
FareCompare.com
O: 972 588 1414
M: 214 681 9018

Re: MapReduce: Reducers partitions.

Posted by Ted Yu <yu...@gmail.com>.

bq. I think it will be better to return something like keycrc%numPartitions

Can you explain how keycrc is obtained ?
I think if we change this logic, we should make it serve (relatively) more
general use case.
But I didn't find, in hadoop 1.0, how Partitioner can accept parameters:

$ find . -name '*.java' -exec grep "mapred.partitioner." {} \; -print
    return getClass("mapred.partitioner.class",
    setClass("mapred.partitioner.class", theClass, Partitioner.class);
./src/mapred/org/apache/hadoop/mapred/JobConf.java
        ensureNotSet("mapred.partitioner.class", mode);
./src/mapred/org/apache/hadoop/mapreduce/Job.java

Cheers

On Wed, Apr 10, 2013 at 12:01 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Greame,
>
> No. The reducer will simply write on the table the same way you are doing a
> regular Put. If a split is required because of the size, then the region
> will be split, but at the end, there will not necessary be any region
> split.
>
> In the usecase described below, all the 600 lines will "simply" go into the
> only region in the table and no split will occur.
>
> The goal is to partition the data for the reducer only. Not in the table.
>
> JM
>
> 2013/4/10 Graeme Wallace <gr...@farecompare.com>
>
> > Whats the behavior then if you return hash % num_reducers and you have no
> > splits defined. When the reducer writes to the table does the region
> server
> > local to the reducer create a new region ?
> >
> > Graeme
> >
> >
> > On Wed, Apr 10, 2013 at 1:26 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > So.
> > >
> > > I looked at the code, and I have one comment/suggestion here.
> > >
> > > If the table we are outputing to has regions, then partitions are build
> > > around that, and that's fine. But if the table is totally empty with a
> > > single region, even if we setNumReduceTasks to 2 or more, all the keys
> > will
> > > go on the same first reducer because of this:
> > >     if (this.startKeys.length == 1){
> > >       return 0;
> > >     }
> > > I think it will be better to return something like keycrc%numPartitions
> > > instead. That still allow the application to spread the reducing
> process
> > > over multinode(racks) even if there is only one region in the table.
> > >
> > > In my usecase, I have millions of lines producing some statistics. At
> the
> > > end, I will have only about 600 lines, but it will take a lot of map
> and
> > > reduce time to go from millions to 600, that's why I'm looking to have
> > more
> > > than one reducer. However, with only 600 lines, it's very difficult to
> > > pre-split the table. Keys are all very close.
> > >
> > > Does anyone see anything wrong with changing this default behaviour
> when
> > > startKeys.length == 1? If not, I will open a JIRA and upload a patch.
> > >
> > > JM
> > >
> > > 2013/4/10 Jean-Marc Spaggiari <je...@spaggiari.org>
> > >
> > > > Thanks Ted.
> > > >
> > > > It's exactly where I was looking at now. I was close. I will take a
> > > deeper
> > > > look.
> > > >
> > > > Thanks Nitin for the link. I will read that too.
> > > >
> > > > JM
> > > >
> > > > 2013/4/10 Nitin Pawar <ni...@gmail.com>
> > > >
> > > >> To add what Ted said,
> > > >>
> > > >> the same discussion happened on the question Jean asked
> > > >>
> > > >> https://issues.apache.org/jira/browse/HBASE-1287
> > > >>
> > > >>
> > > >> On Wed, Apr 10, 2013 at 7:28 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > > >>
> > > >> > Jean-Marc:
> > > >> > Take a look at HRegionPartitioner which is in both mapred and
> > > mapreduce
> > > >> > packages:
> > > >> >
> > > >> >  * This is used to partition the output keys into groups of keys.
> > > >> >
> > > >> >  * Keys are grouped according to the regions that currently exist
> > > >> >
> > > >> >  * so that each reducer fills a single region so load is
> > distributed.
> > > >> >
> > > >> > Cheers
> > > >> >
> > > >> > On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari <
> > > >> > jean-marc@spaggiari.org> wrote:
> > > >> >
> > > >> > > Hi Nitin,
> > > >> > >
> > > >> > > You got my question correctly.
> > > >> > >
> > > >> > > However, I'm wondering how it's working when it's done into
> HBase.
> > > Do
> > > >> > > we have defaults partionners so we have the same garantee that
> > > records
> > > >> > > mapping to one key go to the same reducer. Or do we have to
> > > implement
> > > >> > > this one our own.
> > > >> > >
> > > >> > > JM
> > > >> > >
> > > >> > > 2013/4/10 Nitin Pawar <ni...@gmail.com>:
> > > >> > > > I hope i understood what you are asking is this . If not then
> > > >> pardon me
> > > >> > > :)
> > > >> > > > from the hadoop developer handbook few lines
> > > >> > > >
> > > >> > > > The*Partitioner* class determines which partition a given
> (key,
> > > >> value)
> > > >> > > pair
> > > >> > > > will go to. The default partitioner computes a hash value for
> > the
> > > >> key
> > > >> > and
> > > >> > > > assigns the partition based on this result. It garantees that
> > all
> > > >> the
> > > >> > > > records mapping to one key go to same reducer
> > > >> > > >
> > > >> > > > You can write your custom partitioner as well
> > > >> > > > here is the link :
> > > >> > > >
> > > >>
> http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari <
> > > >> > > > jean-marc@spaggiari.org> wrote:
> > > >> > > >
> > > >> > > >> Hi,
> > > >> > > >>
> > > >> > > >> quick question. How are the data from the map tasks
> > partitionned
> > > >> for
> > > >> > > >> the reducers?
> > > >> > > >>
> > > >> > > >> If there is 1 reducer, it's easy, but if there is more, are
> all
> > > >> they
> > > >> > > >> same keys garanteed to end on the same reducer? Or not
> > necessary?
> > > >>  If
> > > >> > > >> they are not, how can we provide a partionning function?
> > > >> > > >>
> > > >> > > >> Thanks,
> > > >> > > >>
> > > >> > > >> JM
> > > >> > > >>
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > Nitin Pawar
> > > >> > >
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Nitin Pawar
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Graeme Wallace
> > CTO
> > FareCompare.com
> > O: 972 588 1414
> > M: 214 681 9018
> >
>

Re: MapReduce: Reducers partitions.

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Thanks all for your comments.

I looked for partitioners into HBase scope only, that's why I also thought
we where using HTablePartitioner. But looking at the default one used I
found org.apache.hadoop.mapreduce.lib.partition.HashPartitioner like St.Ack
confirmed. And it's doing exactly what I was talking about for the keyhash
(and not keycrc).

Changing HRegionPartitioner behaviour also will be useless because
TableMapReduceUtil will overwrite the number of reducers if we have set
more than the number of regions.

      if (job.getNumReduceTasks() > regions) {
        job.setNumReduceTasks(outputTable.getRegionsInfo().size());
      }

So I just need to stay with the default partioner then.

Thanks,

JM

2013/4/10 Stack <st...@duboce.net>

> On Wed, Apr 10, 2013 at 12:01 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > Hi Greame,
> >
> > No. The reducer will simply write on the table the same way you are
> doing a
> > regular Put. If a split is required because of the size, then the region
> > will be split, but at the end, there will not necessary be any region
> > split.
> >
> > In the usecase described below, all the 600 lines will "simply" go into
> the
> > only region in the table and no split will occur.
> >
> > The goal is to partition the data for the reducer only. Not in the table.
> >
>
>
> Then just use the default partitioner?
>
> The suggestion that you use HTablePartitioner seems inappropriate to your
> task.  See the sink doc here:
>
> http://hadoop.apache.org/docs/r2.0.3-alpha/api/org/apache/hadoop/mapreduce/lib/partition/HashPartitioner.html
>
> St.Ack
>

Re: MapReduce: Reducers partitions.

Posted by Stack <st...@duboce.net>.

On Wed, Apr 10, 2013 at 12:01 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Greame,
>
> No. The reducer will simply write on the table the same way you are doing a
> regular Put. If a split is required because of the size, then the region
> will be split, but at the end, there will not necessary be any region
> split.
>
> In the usecase described below, all the 600 lines will "simply" go into the
> only region in the table and no split will occur.
>
> The goal is to partition the data for the reducer only. Not in the table.
>

Then just use the default partitioner?

The suggestion that you use HTablePartitioner seems inappropriate to your
task.  See the sink doc here:
http://hadoop.apache.org/docs/r2.0.3-alpha/api/org/apache/hadoop/mapreduce/lib/partition/HashPartitioner.html

St.Ack

Re: MapReduce: Reducers partitions.

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Greame,

No. The reducer will simply write on the table the same way you are doing a
regular Put. If a split is required because of the size, then the region
will be split, but at the end, there will not necessary be any region
split.

In the usecase described below, all the 600 lines will "simply" go into the
only region in the table and no split will occur.

The goal is to partition the data for the reducer only. Not in the table.

JM

2013/4/10 Graeme Wallace <gr...@farecompare.com>

> Whats the behavior then if you return hash % num_reducers and you have no
> splits defined. When the reducer writes to the table does the region server
> local to the reducer create a new region ?
>
> Graeme
>
>
> On Wed, Apr 10, 2013 at 1:26 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > So.
> >
> > I looked at the code, and I have one comment/suggestion here.
> >
> > If the table we are outputing to has regions, then partitions are build
> > around that, and that's fine. But if the table is totally empty with a
> > single region, even if we setNumReduceTasks to 2 or more, all the keys
> will
> > go on the same first reducer because of this:
> >     if (this.startKeys.length == 1){
> >       return 0;
> >     }
> > I think it will be better to return something like keycrc%numPartitions
> > instead. That still allow the application to spread the reducing process
> > over multinode(racks) even if there is only one region in the table.
> >
> > In my usecase, I have millions of lines producing some statistics. At the
> > end, I will have only about 600 lines, but it will take a lot of map and
> > reduce time to go from millions to 600, that's why I'm looking to have
> more
> > than one reducer. However, with only 600 lines, it's very difficult to
> > pre-split the table. Keys are all very close.
> >
> > Does anyone see anything wrong with changing this default behaviour when
> > startKeys.length == 1? If not, I will open a JIRA and upload a patch.
> >
> > JM
> >
> > 2013/4/10 Jean-Marc Spaggiari <je...@spaggiari.org>
> >
> > > Thanks Ted.
> > >
> > > It's exactly where I was looking at now. I was close. I will take a
> > deeper
> > > look.
> > >
> > > Thanks Nitin for the link. I will read that too.
> > >
> > > JM
> > >
> > > 2013/4/10 Nitin Pawar <ni...@gmail.com>
> > >
> > >> To add what Ted said,
> > >>
> > >> the same discussion happened on the question Jean asked
> > >>
> > >> https://issues.apache.org/jira/browse/HBASE-1287
> > >>
> > >>
> > >> On Wed, Apr 10, 2013 at 7:28 PM, Ted Yu <yu...@gmail.com> wrote:
> > >>
> > >> > Jean-Marc:
> > >> > Take a look at HRegionPartitioner which is in both mapred and
> > mapreduce
> > >> > packages:
> > >> >
> > >> >  * This is used to partition the output keys into groups of keys.
> > >> >
> > >> >  * Keys are grouped according to the regions that currently exist
> > >> >
> > >> >  * so that each reducer fills a single region so load is
> distributed.
> > >> >
> > >> > Cheers
> > >> >
> > >> > On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari <
> > >> > jean-marc@spaggiari.org> wrote:
> > >> >
> > >> > > Hi Nitin,
> > >> > >
> > >> > > You got my question correctly.
> > >> > >
> > >> > > However, I'm wondering how it's working when it's done into HBase.
> > Do
> > >> > > we have defaults partionners so we have the same garantee that
> > records
> > >> > > mapping to one key go to the same reducer. Or do we have to
> > implement
> > >> > > this one our own.
> > >> > >
> > >> > > JM
> > >> > >
> > >> > > 2013/4/10 Nitin Pawar <ni...@gmail.com>:
> > >> > > > I hope i understood what you are asking is this . If not then
> > >> pardon me
> > >> > > :)
> > >> > > > from the hadoop developer handbook few lines
> > >> > > >
> > >> > > > The*Partitioner* class determines which partition a given (key,
> > >> value)
> > >> > > pair
> > >> > > > will go to. The default partitioner computes a hash value for
> the
> > >> key
> > >> > and
> > >> > > > assigns the partition based on this result. It garantees that
> all
> > >> the
> > >> > > > records mapping to one key go to same reducer
> > >> > > >
> > >> > > > You can write your custom partitioner as well
> > >> > > > here is the link :
> > >> > > >
> > >> http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari <
> > >> > > > jean-marc@spaggiari.org> wrote:
> > >> > > >
> > >> > > >> Hi,
> > >> > > >>
> > >> > > >> quick question. How are the data from the map tasks
> partitionned
> > >> for
> > >> > > >> the reducers?
> > >> > > >>
> > >> > > >> If there is 1 reducer, it's easy, but if there is more, are all
> > >> they
> > >> > > >> same keys garanteed to end on the same reducer? Or not
> necessary?
> > >>  If
> > >> > > >> they are not, how can we provide a partionning function?
> > >> > > >>
> > >> > > >> Thanks,
> > >> > > >>
> > >> > > >> JM
> > >> > > >>
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Nitin Pawar
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Nitin Pawar
> > >>
> > >
> > >
> >
>
>
>
> --
> Graeme Wallace
> CTO
> FareCompare.com
> O: 972 588 1414
> M: 214 681 9018
>

Re: MapReduce: Reducers partitions.

Posted by Graeme Wallace <gr...@farecompare.com>.

Whats the behavior then if you return hash % num_reducers and you have no
splits defined. When the reducer writes to the table does the region server
local to the reducer create a new region ?

Graeme


On Wed, Apr 10, 2013 at 1:26 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> So.
>
> I looked at the code, and I have one comment/suggestion here.
>
> If the table we are outputing to has regions, then partitions are build
> around that, and that's fine. But if the table is totally empty with a
> single region, even if we setNumReduceTasks to 2 or more, all the keys will
> go on the same first reducer because of this:
>     if (this.startKeys.length == 1){
>       return 0;
>     }
> I think it will be better to return something like keycrc%numPartitions
> instead. That still allow the application to spread the reducing process
> over multinode(racks) even if there is only one region in the table.
>
> In my usecase, I have millions of lines producing some statistics. At the
> end, I will have only about 600 lines, but it will take a lot of map and
> reduce time to go from millions to 600, that's why I'm looking to have more
> than one reducer. However, with only 600 lines, it's very difficult to
> pre-split the table. Keys are all very close.
>
> Does anyone see anything wrong with changing this default behaviour when
> startKeys.length == 1? If not, I will open a JIRA and upload a patch.
>
> JM
>
> 2013/4/10 Jean-Marc Spaggiari <je...@spaggiari.org>
>
> > Thanks Ted.
> >
> > It's exactly where I was looking at now. I was close. I will take a
> deeper
> > look.
> >
> > Thanks Nitin for the link. I will read that too.
> >
> > JM
> >
> > 2013/4/10 Nitin Pawar <ni...@gmail.com>
> >
> >> To add what Ted said,
> >>
> >> the same discussion happened on the question Jean asked
> >>
> >> https://issues.apache.org/jira/browse/HBASE-1287
> >>
> >>
> >> On Wed, Apr 10, 2013 at 7:28 PM, Ted Yu <yu...@gmail.com> wrote:
> >>
> >> > Jean-Marc:
> >> > Take a look at HRegionPartitioner which is in both mapred and
> mapreduce
> >> > packages:
> >> >
> >> >  * This is used to partition the output keys into groups of keys.
> >> >
> >> >  * Keys are grouped according to the regions that currently exist
> >> >
> >> >  * so that each reducer fills a single region so load is distributed.
> >> >
> >> > Cheers
> >> >
> >> > On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari <
> >> > jean-marc@spaggiari.org> wrote:
> >> >
> >> > > Hi Nitin,
> >> > >
> >> > > You got my question correctly.
> >> > >
> >> > > However, I'm wondering how it's working when it's done into HBase.
> Do
> >> > > we have defaults partionners so we have the same garantee that
> records
> >> > > mapping to one key go to the same reducer. Or do we have to
> implement
> >> > > this one our own.
> >> > >
> >> > > JM
> >> > >
> >> > > 2013/4/10 Nitin Pawar <ni...@gmail.com>:
> >> > > > I hope i understood what you are asking is this . If not then
> >> pardon me
> >> > > :)
> >> > > > from the hadoop developer handbook few lines
> >> > > >
> >> > > > The*Partitioner* class determines which partition a given (key,
> >> value)
> >> > > pair
> >> > > > will go to. The default partitioner computes a hash value for the
> >> key
> >> > and
> >> > > > assigns the partition based on this result. It garantees that all
> >> the
> >> > > > records mapping to one key go to same reducer
> >> > > >
> >> > > > You can write your custom partitioner as well
> >> > > > here is the link :
> >> > > >
> >> http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari <
> >> > > > jean-marc@spaggiari.org> wrote:
> >> > > >
> >> > > >> Hi,
> >> > > >>
> >> > > >> quick question. How are the data from the map tasks partitionned
> >> for
> >> > > >> the reducers?
> >> > > >>
> >> > > >> If there is 1 reducer, it's easy, but if there is more, are all
> >> they
> >> > > >> same keys garanteed to end on the same reducer? Or not necessary?
> >>  If
> >> > > >> they are not, how can we provide a partionning function?
> >> > > >>
> >> > > >> Thanks,
> >> > > >>
> >> > > >> JM
> >> > > >>
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Nitin Pawar
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Nitin Pawar
> >>
> >
> >
>



-- 
Graeme Wallace
CTO
FareCompare.com
O: 972 588 1414
M: 214 681 9018

Re: MapReduce: Reducers partitions.

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

So.

I looked at the code, and I have one comment/suggestion here.

If the table we are outputing to has regions, then partitions are build
around that, and that's fine. But if the table is totally empty with a
single region, even if we setNumReduceTasks to 2 or more, all the keys will
go on the same first reducer because of this:
    if (this.startKeys.length == 1){
      return 0;
    }
I think it will be better to return something like keycrc%numPartitions
instead. That still allow the application to spread the reducing process
over multinode(racks) even if there is only one region in the table.

In my usecase, I have millions of lines producing some statistics. At the
end, I will have only about 600 lines, but it will take a lot of map and
reduce time to go from millions to 600, that's why I'm looking to have more
than one reducer. However, with only 600 lines, it's very difficult to
pre-split the table. Keys are all very close.

Does anyone see anything wrong with changing this default behaviour when
startKeys.length == 1? If not, I will open a JIRA and upload a patch.

JM

2013/4/10 Jean-Marc Spaggiari <je...@spaggiari.org>

> Thanks Ted.
>
> It's exactly where I was looking at now. I was close. I will take a deeper
> look.
>
> Thanks Nitin for the link. I will read that too.
>
> JM
>
> 2013/4/10 Nitin Pawar <ni...@gmail.com>
>
>> To add what Ted said,
>>
>> the same discussion happened on the question Jean asked
>>
>> https://issues.apache.org/jira/browse/HBASE-1287
>>
>>
>> On Wed, Apr 10, 2013 at 7:28 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>> > Jean-Marc:
>> > Take a look at HRegionPartitioner which is in both mapred and mapreduce
>> > packages:
>> >
>> >  * This is used to partition the output keys into groups of keys.
>> >
>> >  * Keys are grouped according to the regions that currently exist
>> >
>> >  * so that each reducer fills a single region so load is distributed.
>> >
>> > Cheers
>> >
>> > On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari <
>> > jean-marc@spaggiari.org> wrote:
>> >
>> > > Hi Nitin,
>> > >
>> > > You got my question correctly.
>> > >
>> > > However, I'm wondering how it's working when it's done into HBase. Do
>> > > we have defaults partionners so we have the same garantee that records
>> > > mapping to one key go to the same reducer. Or do we have to implement
>> > > this one our own.
>> > >
>> > > JM
>> > >
>> > > 2013/4/10 Nitin Pawar <ni...@gmail.com>:
>> > > > I hope i understood what you are asking is this . If not then
>> pardon me
>> > > :)
>> > > > from the hadoop developer handbook few lines
>> > > >
>> > > > The*Partitioner* class determines which partition a given (key,
>> value)
>> > > pair
>> > > > will go to. The default partitioner computes a hash value for the
>> key
>> > and
>> > > > assigns the partition based on this result. It garantees that all
>> the
>> > > > records mapping to one key go to same reducer
>> > > >
>> > > > You can write your custom partitioner as well
>> > > > here is the link :
>> > > >
>> http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari <
>> > > > jean-marc@spaggiari.org> wrote:
>> > > >
>> > > >> Hi,
>> > > >>
>> > > >> quick question. How are the data from the map tasks partitionned
>> for
>> > > >> the reducers?
>> > > >>
>> > > >> If there is 1 reducer, it's easy, but if there is more, are all
>> they
>> > > >> same keys garanteed to end on the same reducer? Or not necessary?
>>  If
>> > > >> they are not, how can we provide a partionning function?
>> > > >>
>> > > >> Thanks,
>> > > >>
>> > > >> JM
>> > > >>
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Nitin Pawar
>> > >
>> >
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Re: MapReduce: Reducers partitions.

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Thanks Ted.

It's exactly where I was looking at now. I was close. I will take a deeper
look.

Thanks Nitin for the link. I will read that too.

JM

2013/4/10 Nitin Pawar <ni...@gmail.com>

> To add what Ted said,
>
> the same discussion happened on the question Jean asked
>
> https://issues.apache.org/jira/browse/HBASE-1287
>
>
> On Wed, Apr 10, 2013 at 7:28 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Jean-Marc:
> > Take a look at HRegionPartitioner which is in both mapred and mapreduce
> > packages:
> >
> >  * This is used to partition the output keys into groups of keys.
> >
> >  * Keys are grouped according to the regions that currently exist
> >
> >  * so that each reducer fills a single region so load is distributed.
> >
> > Cheers
> >
> > On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > Hi Nitin,
> > >
> > > You got my question correctly.
> > >
> > > However, I'm wondering how it's working when it's done into HBase. Do
> > > we have defaults partionners so we have the same garantee that records
> > > mapping to one key go to the same reducer. Or do we have to implement
> > > this one our own.
> > >
> > > JM
> > >
> > > 2013/4/10 Nitin Pawar <ni...@gmail.com>:
> > > > I hope i understood what you are asking is this . If not then pardon
> me
> > > :)
> > > > from the hadoop developer handbook few lines
> > > >
> > > > The*Partitioner* class determines which partition a given (key,
> value)
> > > pair
> > > > will go to. The default partitioner computes a hash value for the key
> > and
> > > > assigns the partition based on this result. It garantees that all the
> > > > records mapping to one key go to same reducer
> > > >
> > > > You can write your custom partitioner as well
> > > > here is the link :
> > > > http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari <
> > > > jean-marc@spaggiari.org> wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> quick question. How are the data from the map tasks partitionned for
> > > >> the reducers?
> > > >>
> > > >> If there is 1 reducer, it's easy, but if there is more, are all they
> > > >> same keys garanteed to end on the same reducer? Or not necessary?
>  If
> > > >> they are not, how can we provide a partionning function?
> > > >>
> > > >> Thanks,
> > > >>
> > > >> JM
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Nitin Pawar
> > >
> >
>
>
>
> --
> Nitin Pawar
>

Re: MapReduce: Reducers partitions.

Posted by Nitin Pawar <ni...@gmail.com>.

To add what Ted said,

the same discussion happened on the question Jean asked

https://issues.apache.org/jira/browse/HBASE-1287


On Wed, Apr 10, 2013 at 7:28 PM, Ted Yu <yu...@gmail.com> wrote:

> Jean-Marc:
> Take a look at HRegionPartitioner which is in both mapred and mapreduce
> packages:
>
>  * This is used to partition the output keys into groups of keys.
>
>  * Keys are grouped according to the regions that currently exist
>
>  * so that each reducer fills a single region so load is distributed.
>
> Cheers
>
> On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > Hi Nitin,
> >
> > You got my question correctly.
> >
> > However, I'm wondering how it's working when it's done into HBase. Do
> > we have defaults partionners so we have the same garantee that records
> > mapping to one key go to the same reducer. Or do we have to implement
> > this one our own.
> >
> > JM
> >
> > 2013/4/10 Nitin Pawar <ni...@gmail.com>:
> > > I hope i understood what you are asking is this . If not then pardon me
> > :)
> > > from the hadoop developer handbook few lines
> > >
> > > The*Partitioner* class determines which partition a given (key, value)
> > pair
> > > will go to. The default partitioner computes a hash value for the key
> and
> > > assigns the partition based on this result. It garantees that all the
> > > records mapping to one key go to same reducer
> > >
> > > You can write your custom partitioner as well
> > > here is the link :
> > > http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning
> > >
> > >
> > >
> > >
> > > On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org> wrote:
> > >
> > >> Hi,
> > >>
> > >> quick question. How are the data from the map tasks partitionned for
> > >> the reducers?
> > >>
> > >> If there is 1 reducer, it's easy, but if there is more, are all they
> > >> same keys garanteed to end on the same reducer? Or not necessary?  If
> > >> they are not, how can we provide a partionning function?
> > >>
> > >> Thanks,
> > >>
> > >> JM
> > >>
> > >
> > >
> > >
> > > --
> > > Nitin Pawar
> >
>



-- 
Nitin Pawar

Re: MapReduce: Reducers partitions.

Posted by Ted Yu <yu...@gmail.com>.

Jean-Marc:
Take a look at HRegionPartitioner which is in both mapred and mapreduce
packages:

 * This is used to partition the output keys into groups of keys.

 * Keys are grouped according to the regions that currently exist

 * so that each reducer fills a single region so load is distributed.

Cheers

On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Nitin,
>
> You got my question correctly.
>
> However, I'm wondering how it's working when it's done into HBase. Do
> we have defaults partionners so we have the same garantee that records
> mapping to one key go to the same reducer. Or do we have to implement
> this one our own.
>
> JM
>
> 2013/4/10 Nitin Pawar <ni...@gmail.com>:
> > I hope i understood what you are asking is this . If not then pardon me
> :)
> > from the hadoop developer handbook few lines
> >
> > The*Partitioner* class determines which partition a given (key, value)
> pair
> > will go to. The default partitioner computes a hash value for the key and
> > assigns the partition based on this result. It garantees that all the
> > records mapping to one key go to same reducer
> >
> > You can write your custom partitioner as well
> > here is the link :
> > http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning
> >
> >
> >
> >
> > On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> >> Hi,
> >>
> >> quick question. How are the data from the map tasks partitionned for
> >> the reducers?
> >>
> >> If there is 1 reducer, it's easy, but if there is more, are all they
> >> same keys garanteed to end on the same reducer? Or not necessary?  If
> >> they are not, how can we provide a partionning function?
> >>
> >> Thanks,
> >>
> >> JM
> >>
> >
> >
> >
> > --
> > Nitin Pawar
>

Re: MapReduce: Reducers partitions.

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Nitin,

You got my question correctly.

However, I'm wondering how it's working when it's done into HBase. Do
we have defaults partionners so we have the same garantee that records
mapping to one key go to the same reducer. Or do we have to implement
this one our own.

JM

2013/4/10 Nitin Pawar <ni...@gmail.com>:
> I hope i understood what you are asking is this . If not then pardon me :)
> from the hadoop developer handbook few lines
>
> The*Partitioner* class determines which partition a given (key, value) pair
> will go to. The default partitioner computes a hash value for the key and
> assigns the partition based on this result. It garantees that all the
> records mapping to one key go to same reducer
>
> You can write your custom partitioner as well
> here is the link :
> http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning
>
>
>
>
> On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> Hi,
>>
>> quick question. How are the data from the map tasks partitionned for
>> the reducers?
>>
>> If there is 1 reducer, it's easy, but if there is more, are all they
>> same keys garanteed to end on the same reducer? Or not necessary?  If
>> they are not, how can we provide a partionning function?
>>
>> Thanks,
>>
>> JM
>>
>
>
>
> --
> Nitin Pawar

Re: MapReduce: Reducers partitions.

Posted by Nitin Pawar <ni...@gmail.com>.

I hope i understood what you are asking is this . If not then pardon me :)
from the hadoop developer handbook few lines

The*Partitioner* class determines which partition a given (key, value) pair
will go to. The default partitioner computes a hash value for the key and
assigns the partition based on this result. It garantees that all the
records mapping to one key go to same reducer

You can write your custom partitioner as well
here is the link :
http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning

On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi,
>
> quick question. How are the data from the map tasks partitionned for
> the reducers?
>
> If there is 1 reducer, it's easy, but if there is more, are all they
> same keys garanteed to end on the same reducer? Or not necessary?  If
> they are not, how can we provide a partionning function?
>
> Thanks,
>
> JM
>

-- 
Nitin Pawar