You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Panayotis Antonopoulos <an...@hotmail.com> on 2011/05/25 12:58:50 UTC

HFiles that fit within a single region VS better load balancing at reduce phase

Hello,
I am currently working on a MR job that will output HFiles that will be bulk loaded in an HBase Table.
According to the HBase site in order for the bulk loading to be efficient each HFile of the MR job should fit within a single region.
In order to achieve that I use the TotalOrderPartitioner so that each reducer gets Key/Value pairs from a single region.
However this prevents partitioning Mapper's output in equal splits so that I have the best possible load balancing during the reduce phase.

So I would like to ask you how important is to create HFiles that fit within a single region.
If it makes bulk loading much faster probably it is better to sacrifice load balancing.
But is this the case?
Has anyone tried both choices?

Thank you in advance!
Panagiotis.
 		 	   		  

Re: HFiles that fit within a single region VS better load balancing at reduce phase

Posted by Ted Yu <yu...@gmail.com>.
HBASE-3721 was integrated to trunk, not 0.90.x
HBASE-3871 is under review.

So I would interpret my answer as tilting toward outputing Hfiles that fit
within a single Region.

If, after your effort, there're still some HFiles that don't fit. You can
try my patches.

Thanks

2011/5/25 Panayotis Antonopoulos <an...@hotmail.com>

>
> So your answer would be that it is better to have the best possible load
> balancing during the reduce phase instead of taking care to output Hfiles
> that fit within a single Region, because splitting done by Incremental Load
> is rather fast?
>
> > Date: Wed, 25 May 2011 09:20:10 -0700
> > Subject: Re: HFiles that fit within a single region VS better load
> balancing at reduce phase
> > From: yuzhihong@gmail.com
> > To: user@hbase.apache.org
> >
> > LoadIncrementalHFiles would split HFile if it doesn't fit within a single
> > region.
> >
> > Please refer to the following JIRAs which speedup LoadIncrementalHFiles:
> > https://issues.apache.org/jira/browse/HBASE-3871
> > https://issues.apache.org/jira/browse/HBASE-3721
> >
> > Note: parallelizing splitting of HFile(s) by LoadIncrementalHFiles is
> done
> > on a single machine.
> >
> > Thanks
> >
> > 2011/5/25 Panayotis Antonopoulos <an...@hotmail.com>
> >
> > >
> > > Hello,
> > > I am currently working on a MR job that will output HFiles that will be
> > > bulk loaded in an HBase Table.
> > > According to the HBase site in order for the bulk loading to be
> efficient
> > > each HFile of the MR job should fit within a single region.
> > > In order to achieve that I use the TotalOrderPartitioner so that each
> > > reducer gets Key/Value pairs from a single region.
> > > However this prevents partitioning Mapper's output in equal splits so
> that
> > > I have the best possible load balancing during the reduce phase.
> > >
> > > So I would like to ask you how important is to create HFiles that fit
> > > within a single region.
> > > If it makes bulk loading much faster probably it is better to sacrifice
> > > load balancing.
> > > But is this the case?
> > > Has anyone tried both choices?
> > >
> > > Thank you in advance!
> > > Panagiotis.
> > >
>
>

RE: HFiles that fit within a single region VS better load balancing at reduce phase

Posted by Panayotis Antonopoulos <an...@hotmail.com>.
So your answer would be that it is better to have the best possible load balancing during the reduce phase instead of taking care to output Hfiles that fit within a single Region, because splitting done by Incremental Load is rather fast?

> Date: Wed, 25 May 2011 09:20:10 -0700
> Subject: Re: HFiles that fit within a single region VS better load balancing at reduce phase
> From: yuzhihong@gmail.com
> To: user@hbase.apache.org
> 
> LoadIncrementalHFiles would split HFile if it doesn't fit within a single
> region.
> 
> Please refer to the following JIRAs which speedup LoadIncrementalHFiles:
> https://issues.apache.org/jira/browse/HBASE-3871
> https://issues.apache.org/jira/browse/HBASE-3721
> 
> Note: parallelizing splitting of HFile(s) by LoadIncrementalHFiles is done
> on a single machine.
> 
> Thanks
> 
> 2011/5/25 Panayotis Antonopoulos <an...@hotmail.com>
> 
> >
> > Hello,
> > I am currently working on a MR job that will output HFiles that will be
> > bulk loaded in an HBase Table.
> > According to the HBase site in order for the bulk loading to be efficient
> > each HFile of the MR job should fit within a single region.
> > In order to achieve that I use the TotalOrderPartitioner so that each
> > reducer gets Key/Value pairs from a single region.
> > However this prevents partitioning Mapper's output in equal splits so that
> > I have the best possible load balancing during the reduce phase.
> >
> > So I would like to ask you how important is to create HFiles that fit
> > within a single region.
> > If it makes bulk loading much faster probably it is better to sacrifice
> > load balancing.
> > But is this the case?
> > Has anyone tried both choices?
> >
> > Thank you in advance!
> > Panagiotis.
> >
 		 	   		  

Re: HFiles that fit within a single region VS better load balancing at reduce phase

Posted by Ted Yu <yu...@gmail.com>.
LoadIncrementalHFiles would split HFile if it doesn't fit within a single
region.

Please refer to the following JIRAs which speedup LoadIncrementalHFiles:
https://issues.apache.org/jira/browse/HBASE-3871
https://issues.apache.org/jira/browse/HBASE-3721

Note: parallelizing splitting of HFile(s) by LoadIncrementalHFiles is done
on a single machine.

Thanks

2011/5/25 Panayotis Antonopoulos <an...@hotmail.com>

>
> Hello,
> I am currently working on a MR job that will output HFiles that will be
> bulk loaded in an HBase Table.
> According to the HBase site in order for the bulk loading to be efficient
> each HFile of the MR job should fit within a single region.
> In order to achieve that I use the TotalOrderPartitioner so that each
> reducer gets Key/Value pairs from a single region.
> However this prevents partitioning Mapper's output in equal splits so that
> I have the best possible load balancing during the reduce phase.
>
> So I would like to ask you how important is to create HFiles that fit
> within a single region.
> If it makes bulk loading much faster probably it is better to sacrifice
> load balancing.
> But is this the case?
> Has anyone tried both choices?
>
> Thank you in advance!
> Panagiotis.
>