You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Geoff Hendrey <gh...@decarta.com> on 2011/02/10 08:36:56 UTC

getSplits question

Are endrows inclusive or exclusive? The docs say exclusive, but then the
question arises as to how to form the last split for getSplits(). The
code below runs fine, but I believe it is omitting some rows, perhaps
b/c of the exclusive end row. For the final split, should the endrow be
null? I tried that, and got what appeared to be a final split without an
endrow at all. Would appreciate a pointer to the correct implementation
of getSplits in which I desire to provide a startrow, endrow, and
splitsize. Apparently this isn't it J :

 

int splitSize = context.getConfiguration().getInt("splitsize", 1000);

                byte[] splitStop = null;

                String hostname = null;

                while ((results = resultScanner.next(splitSize)).length
> 0) {

                    //   System.out.println("results
:-------------------------- "+results);

                    byte[] splitStart = results[0].getRow();

                    splitStop = results[results.length - 1].getRow();
//I think this is a problem...we don't actually include this row in the
split since it's exclusive..revisit this and correct

                    HRegionLocation location =
table.getRegionLocation(splitStart);

                    hostname =
location.getServerAddress().getHostname();

                    InputSplit split = new
TableSplit(table.getTableName(), splitStart, splitStop, hostname);

                    splits.add(split);

                    System.out.println("initializing splits: " +
split.toString());

                }

                resultScanner.close();

 

 

-g

Re: question about org.apache.hadoop.hbase.util.Merge

Posted by Sebastian Bauer <ad...@ugame.net.pl>.

Yep there was few problems, but now its working with trunk and supposed 
to work with 0.90. Biggest problem was that OnlineMerge doesn't assign 
new region and that caused test to failed with 
"NoServerForRegionException: No server address listed in .META. for 
region ..."

On 10.02.2011 20:56, Jean-Daniel Cryans wrote:
> For the record, I posted a patch in this jira that does online merge
> on 0.89 but it's awfully buggy
> https://issues.apache.org/jira/browse/HBASE-1621
>
> It's probably all different in 0.90 now tho.
>
> J-D
>
> On Thu, Feb 10, 2011 at 11:51 AM, Ryan Rawson<ry...@gmail.com>  wrote:
>> Since the Merge tool works on an offline cluster, it goes straight to
>> the META HFiles, thus cannot be run in parallel.
>>
>> It shouldn't be too hard to hack up Merge to work on an online
>> cluster, offline table.
>>
>>
>>
>> On Thu, Feb 10, 2011 at 10:09 AM, Jean-Daniel Cryans
>> <jd...@apache.org>  wrote:
>>> I think not, it opens and edits .META. so it would be like having
>>> multiple region servers serving it (which is always bad).
>>>
>>> J-D
>>>
>>> On Thu, Feb 10, 2011 at 5:22 AM, Sebastian Bauer<ad...@ugame.net.pl>  wrote:
>>>> Hi, is anybody knows that "./bin/hbase org.apache.hadoop.hbase.util.Merge "
>>>> can run in parallel?
>>>>
>>>> Thanks,
>>>> Sebastian Bauer
>>>>

Re: question about org.apache.hadoop.hbase.util.Merge

Posted by Jean-Daniel Cryans <jd...@apache.org>.

For the record, I posted a patch in this jira that does online merge
on 0.89 but it's awfully buggy
https://issues.apache.org/jira/browse/HBASE-1621

It's probably all different in 0.90 now tho.

J-D

On Thu, Feb 10, 2011 at 11:51 AM, Ryan Rawson <ry...@gmail.com> wrote:
> Since the Merge tool works on an offline cluster, it goes straight to
> the META HFiles, thus cannot be run in parallel.
>
> It shouldn't be too hard to hack up Merge to work on an online
> cluster, offline table.
>
>
>
> On Thu, Feb 10, 2011 at 10:09 AM, Jean-Daniel Cryans
> <jd...@apache.org> wrote:
>> I think not, it opens and edits .META. so it would be like having
>> multiple region servers serving it (which is always bad).
>>
>> J-D
>>
>> On Thu, Feb 10, 2011 at 5:22 AM, Sebastian Bauer <ad...@ugame.net.pl> wrote:
>>> Hi, is anybody knows that "./bin/hbase org.apache.hadoop.hbase.util.Merge "
>>> can run in parallel?
>>>
>>> Thanks,
>>> Sebastian Bauer
>>>
>>
>

Re: question about org.apache.hadoop.hbase.util.Merge

Posted by Ryan Rawson <ry...@gmail.com>.

Since the Merge tool works on an offline cluster, it goes straight to
the META HFiles, thus cannot be run in parallel.

It shouldn't be too hard to hack up Merge to work on an online
cluster, offline table.

On Thu, Feb 10, 2011 at 10:09 AM, Jean-Daniel Cryans
<jd...@apache.org> wrote:
> I think not, it opens and edits .META. so it would be like having
> multiple region servers serving it (which is always bad).
>
> J-D
>
> On Thu, Feb 10, 2011 at 5:22 AM, Sebastian Bauer <ad...@ugame.net.pl> wrote:
>> Hi, is anybody knows that "./bin/hbase org.apache.hadoop.hbase.util.Merge "
>> can run in parallel?
>>
>> Thanks,
>> Sebastian Bauer
>>
>

Re: question about org.apache.hadoop.hbase.util.Merge

Posted by Jean-Daniel Cryans <jd...@apache.org>.

I think not, it opens and edits .META. so it would be like having
multiple region servers serving it (which is always bad).

J-D

On Thu, Feb 10, 2011 at 5:22 AM, Sebastian Bauer <ad...@ugame.net.pl> wrote:
> Hi, is anybody knows that "./bin/hbase org.apache.hadoop.hbase.util.Merge "
> can run in parallel?
>
> Thanks,
> Sebastian Bauer
>

question about org.apache.hadoop.hbase.util.Merge

Posted by Sebastian Bauer <ad...@ugame.net.pl>.

Hi, is anybody knows that "./bin/hbase 
org.apache.hadoop.hbase.util.Merge " can run in parallel?

Thanks,
Sebastian Bauer

Re: getSplits question

Posted by Jean-Daniel Cryans <jd...@apache.org>.

There's the "split" command in the shel.

HBaseAdmin has that same method.

In the table's page from the master's web UI, there's a "split" button.

Finally, when creating a table, you can pre-specify all the split keys
with this method:
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor,
byte[][])

J-D

On Thu, Feb 10, 2011 at 8:48 AM, Geoff Hendrey <gh...@decarta.com> wrote:
> I hunted around for some info on how to force a table to split, but I
> didn't find what I was looking for. Is there a command I can issue from
> the Hbase shell that would force every existing region to divide in
> half? That would be quite useful. If not, what's the next best way to
> force splits.
>
> thanks!
> -g
>
> -----Original Message-----
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Thursday, February 10, 2011 8:15 AM
> To: user@hbase.apache.org
> Cc: hbase-user@hadoop.apache.org
> Subject: RE: getSplits question
>
>
> Ryan,
>
> Just to point out the obvious...
>
> On smaller tables where you don't get enough parallelism, you can
> manually force the table's regions to be split.
> My understanding that if/when the table grows it will then go back to
> splitting normally.
>
> This way if you have a 'small' look up table that is relatively static,
> you manually split it to the 'right' size for your cloud.
> If you are seeding a system, you can do the splits to get good
> parallelism and not overload a single region with inserts, then let it
> go back to its normal growth pattern and splits.
>
> This would solve the OP's issue and as you point out, not worry about
> getSplits().
>
> Does this make sense, or am I missing something?
>
> -Mike
>
>> Date: Wed, 9 Feb 2011 23:54:19 -0800
>> Subject: Re: getSplits question
>> From: ryanobjc@gmail.com
>> To: user@hbase.apache.org
>> CC: hbase-user@hadoop.apache.org
>>
>> By default each map gets the contents of 1 region. A region is by
>> default a maximum of 256MB. There is no trivial way to generally
>> bisect a region in half, in terms of row count, by just knowing what
>> we known (start, end key).
>>
>> For very large tables that have > 100 regions, this algorithm works
>> really well and you get some good parallelism.  If you want to see a
>> lot of parallelism out of 1 region, you might have to work a lot
>> harder.  Or reduce your region size and have more regions.  Be warned
>> though, that more regions has performance hits in other areas
>> (specifically server startup/shutdown/assignment times).  So you
>> probably dont want 50,000 32MB regions.
>>
>> -ryan
>>
>> On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey <gh...@decarta.com>
> wrote:
>> > Oh, I definitely don't *need* my own to run mapreduce. However, if I
> want to control the number of records handled by each mapper (splitsize)
> and the startrow and endrow, then I thought I had to write my own
> getSplits(). Is there another way to accomplish this, because I do need
> the combination of controlled splitsize and start/endrow.
>> >
>> > -geoff
>> >
>> > -----Original Message-----
>> > From: Ryan Rawson [mailto:ryanobjc@gmail.com]
>> > Sent: Wednesday, February 09, 2011 11:43 PM
>> > To: user@hbase.apache.org
>> > Cc: hbase-user@hadoop.apache.org
>> > Subject: Re: getSplits question
>> >
>> > You shouldn't need to write your own getSplits() method to run a map
>> > reduce, I never did at least...
>> >
>> > -ryan
>> >
>> > On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey
> <gh...@decarta.com> wrote:
>> >> Are endrows inclusive or exclusive? The docs say exclusive, but
> then the
>> >> question arises as to how to form the last split for getSplits().
> The
>> >> code below runs fine, but I believe it is omitting some rows,
> perhaps
>> >> b/c of the exclusive end row. For the final split, should the
> endrow be
>> >> null? I tried that, and got what appeared to be a final split
> without an
>> >> endrow at all. Would appreciate a pointer to the correct
> implementation
>> >> of getSplits in which I desire to provide a startrow, endrow, and
>> >> splitsize. Apparently this isn't it J :
>> >>
>> >>
>> >>
>> >> int splitSize = context.getConfiguration().getInt("splitsize",
> 1000);
>> >>
>> >>                byte[] splitStop = null;
>> >>
>> >>                String hostname = null;
>> >>
>> >>                while ((results =
> resultScanner.next(splitSize)).length
>> >>> 0) {
>> >>
>> >>                    //   System.out.println("results
>> >> :-------------------------- "+results);
>> >>
>> >>                    byte[] splitStart = results[0].getRow();
>> >>
>> >>                    splitStop = results[results.length -
> 1].getRow();
>> >> //I think this is a problem...we don't actually include this row in
> the
>> >> split since it's exclusive..revisit this and correct
>> >>
>> >>                    HRegionLocation location =
>> >> table.getRegionLocation(splitStart);
>> >>
>> >>                    hostname =
>> >> location.getServerAddress().getHostname();
>> >>
>> >>                    InputSplit split = new
>> >> TableSplit(table.getTableName(), splitStart, splitStop, hostname);
>> >>
>> >>                    splits.add(split);
>> >>
>> >>                    System.out.println("initializing splits: " +
>> >> split.toString());
>> >>
>> >>                }
>> >>
>> >>                resultScanner.close();
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> -g
>> >>
>> >>
>> >
>
>

RE: getSplits question

Posted by Geoff Hendrey <gh...@decarta.com>.

I hunted around for some info on how to force a table to split, but I
didn't find what I was looking for. Is there a command I can issue from
the Hbase shell that would force every existing region to divide in
half? That would be quite useful. If not, what's the next best way to
force splits.

thanks!
-g

-----Original Message-----
From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Thursday, February 10, 2011 8:15 AM
To: user@hbase.apache.org
Cc: hbase-user@hadoop.apache.org
Subject: RE: getSplits question


Ryan,

Just to point out the obvious...

On smaller tables where you don't get enough parallelism, you can
manually force the table's regions to be split.
My understanding that if/when the table grows it will then go back to
splitting normally. 

This way if you have a 'small' look up table that is relatively static,
you manually split it to the 'right' size for your cloud. 
If you are seeding a system, you can do the splits to get good
parallelism and not overload a single region with inserts, then let it
go back to its normal growth pattern and splits.

This would solve the OP's issue and as you point out, not worry about
getSplits().

Does this make sense, or am I missing something?

-Mike

> Date: Wed, 9 Feb 2011 23:54:19 -0800
> Subject: Re: getSplits question
> From: ryanobjc@gmail.com
> To: user@hbase.apache.org
> CC: hbase-user@hadoop.apache.org
> 
> By default each map gets the contents of 1 region. A region is by
> default a maximum of 256MB. There is no trivial way to generally
> bisect a region in half, in terms of row count, by just knowing what
> we known (start, end key).
> 
> For very large tables that have > 100 regions, this algorithm works
> really well and you get some good parallelism.  If you want to see a
> lot of parallelism out of 1 region, you might have to work a lot
> harder.  Or reduce your region size and have more regions.  Be warned
> though, that more regions has performance hits in other areas
> (specifically server startup/shutdown/assignment times).  So you
> probably dont want 50,000 32MB regions.
> 
> -ryan
> 
> On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey <gh...@decarta.com>
wrote:
> > Oh, I definitely don't *need* my own to run mapreduce. However, if I
want to control the number of records handled by each mapper (splitsize)
and the startrow and endrow, then I thought I had to write my own
getSplits(). Is there another way to accomplish this, because I do need
the combination of controlled splitsize and start/endrow.
> >
> > -geoff
> >
> > -----Original Message-----
> > From: Ryan Rawson [mailto:ryanobjc@gmail.com]
> > Sent: Wednesday, February 09, 2011 11:43 PM
> > To: user@hbase.apache.org
> > Cc: hbase-user@hadoop.apache.org
> > Subject: Re: getSplits question
> >
> > You shouldn't need to write your own getSplits() method to run a map
> > reduce, I never did at least...
> >
> > -ryan
> >
> > On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey
<gh...@decarta.com> wrote:
> >> Are endrows inclusive or exclusive? The docs say exclusive, but
then the
> >> question arises as to how to form the last split for getSplits().
The
> >> code below runs fine, but I believe it is omitting some rows,
perhaps
> >> b/c of the exclusive end row. For the final split, should the
endrow be
> >> null? I tried that, and got what appeared to be a final split
without an
> >> endrow at all. Would appreciate a pointer to the correct
implementation
> >> of getSplits in which I desire to provide a startrow, endrow, and
> >> splitsize. Apparently this isn't it J :
> >>
> >>
> >>
> >> int splitSize = context.getConfiguration().getInt("splitsize",
1000);
> >>
> >>                byte[] splitStop = null;
> >>
> >>                String hostname = null;
> >>
> >>                while ((results =
resultScanner.next(splitSize)).length
> >>> 0) {
> >>
> >>                    //   System.out.println("results
> >> :-------------------------- "+results);
> >>
> >>                    byte[] splitStart = results[0].getRow();
> >>
> >>                    splitStop = results[results.length -
1].getRow();
> >> //I think this is a problem...we don't actually include this row in
the
> >> split since it's exclusive..revisit this and correct
> >>
> >>                    HRegionLocation location =
> >> table.getRegionLocation(splitStart);
> >>
> >>                    hostname =
> >> location.getServerAddress().getHostname();
> >>
> >>                    InputSplit split = new
> >> TableSplit(table.getTableName(), splitStart, splitStop, hostname);
> >>
> >>                    splits.add(split);
> >>
> >>                    System.out.println("initializing splits: " +
> >> split.toString());
> >>
> >>                }
> >>
> >>                resultScanner.close();
> >>
> >>
> >>
> >>
> >>
> >> -g
> >>
> >>
> >

Re: RE: getSplits question

Posted by Ryan Rawson <ry...@gmail.com>.

Yep, you're right on there.
On Feb 10, 2011 8:15 AM, "Michael Segel" <mi...@hotmail.com> wrote:
>
> Ryan,
>
> Just to point out the obvious...
>
> On smaller tables where you don't get enough parallelism, you can manually
force the table's regions to be split.
> My understanding that if/when the table grows it will then go back to
splitting normally.
>
> This way if you have a 'small' look up table that is relatively static,
you manually split it to the 'right' size for your cloud.
> If you are seeding a system, you can do the splits to get good parallelism
and not overload a single region with inserts, then let it go back to its
normal growth pattern and splits.
>
> This would solve the OP's issue and as you point out, not worry about
getSplits().
>
> Does this make sense, or am I missing something?
>
> -Mike
>
>> Date: Wed, 9 Feb 2011 23:54:19 -0800
>> Subject: Re: getSplits question
>> From: ryanobjc@gmail.com
>> To: user@hbase.apache.org
>> CC: hbase-user@hadoop.apache.org
>>
>> By default each map gets the contents of 1 region. A region is by
>> default a maximum of 256MB. There is no trivial way to generally
>> bisect a region in half, in terms of row count, by just knowing what
>> we known (start, end key).
>>
>> For very large tables that have > 100 regions, this algorithm works
>> really well and you get some good parallelism. If you want to see a
>> lot of parallelism out of 1 region, you might have to work a lot
>> harder. Or reduce your region size and have more regions. Be warned
>> though, that more regions has performance hits in other areas
>> (specifically server startup/shutdown/assignment times). So you
>> probably dont want 50,000 32MB regions.
>>
>> -ryan
>>
>> On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey <gh...@decarta.com>
wrote:
>> > Oh, I definitely don't *need* my own to run mapreduce. However, if I
want to control the number of records handled by each mapper (splitsize) and
the startrow and endrow, then I thought I had to write my own getSplits().
Is there another way to accomplish this, because I do need the combination
of controlled splitsize and start/endrow.
>> >
>> > -geoff
>> >
>> > -----Original Message-----
>> > From: Ryan Rawson [mailto:ryanobjc@gmail.com]
>> > Sent: Wednesday, February 09, 2011 11:43 PM
>> > To: user@hbase.apache.org
>> > Cc: hbase-user@hadoop.apache.org
>> > Subject: Re: getSplits question
>> >
>> > You shouldn't need to write your own getSplits() method to run a map
>> > reduce, I never did at least...
>> >
>> > -ryan
>> >
>> > On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey <gh...@decarta.com>
wrote:
>> >> Are endrows inclusive or exclusive? The docs say exclusive, but then
the
>> >> question arises as to how to form the last split for getSplits(). The
>> >> code below runs fine, but I believe it is omitting some rows, perhaps
>> >> b/c of the exclusive end row. For the final split, should the endrow
be
>> >> null? I tried that, and got what appeared to be a final split without
an
>> >> endrow at all. Would appreciate a pointer to the correct
implementation
>> >> of getSplits in which I desire to provide a startrow, endrow, and
>> >> splitsize. Apparently this isn't it J :
>> >>
>> >>
>> >>
>> >> int splitSize = context.getConfiguration().getInt("splitsize", 1000);
>> >>
>> >> byte[] splitStop = null;
>> >>
>> >> String hostname = null;
>> >>
>> >> while ((results = resultScanner.next(splitSize)).length
>> >>> 0) {
>> >>
>> >> // System.out.println("results
>> >> :-------------------------- "+results);
>> >>
>> >> byte[] splitStart = results[0].getRow();
>> >>
>> >> splitStop = results[results.length - 1].getRow();
>> >> //I think this is a problem...we don't actually include this row in
the
>> >> split since it's exclusive..revisit this and correct
>> >>
>> >> HRegionLocation location =
>> >> table.getRegionLocation(splitStart);
>> >>
>> >> hostname =
>> >> location.getServerAddress().getHostname();
>> >>
>> >> InputSplit split = new
>> >> TableSplit(table.getTableName(), splitStart, splitStop, hostname);
>> >>
>> >> splits.add(split);
>> >>
>> >> System.out.println("initializing splits: " +
>> >> split.toString());
>> >>
>> >> }
>> >>
>> >> resultScanner.close();
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> -g
>> >>
>> >>
>> >
>

RE: getSplits question

Posted by Michael Segel <mi...@hotmail.com>.

Ryan,

Just to point out the obvious...

On smaller tables where you don't get enough parallelism, you can manually force the table's regions to be split.
My understanding that if/when the table grows it will then go back to splitting normally. 

This way if you have a 'small' look up table that is relatively static, you manually split it to the 'right' size for your cloud. 
If you are seeding a system, you can do the splits to get good parallelism and not overload a single region with inserts, then let it go back to its normal growth pattern and splits.

This would solve the OP's issue and as you point out, not worry about getSplits().

Does this make sense, or am I missing something?

-Mike

> Date: Wed, 9 Feb 2011 23:54:19 -0800
> Subject: Re: getSplits question
> From: ryanobjc@gmail.com
> To: user@hbase.apache.org
> CC: hbase-user@hadoop.apache.org
> 
> By default each map gets the contents of 1 region. A region is by
> default a maximum of 256MB. There is no trivial way to generally
> bisect a region in half, in terms of row count, by just knowing what
> we known (start, end key).
> 
> For very large tables that have > 100 regions, this algorithm works
> really well and you get some good parallelism.  If you want to see a
> lot of parallelism out of 1 region, you might have to work a lot
> harder.  Or reduce your region size and have more regions.  Be warned
> though, that more regions has performance hits in other areas
> (specifically server startup/shutdown/assignment times).  So you
> probably dont want 50,000 32MB regions.
> 
> -ryan
> 
> On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey <gh...@decarta.com> wrote:
> > Oh, I definitely don't *need* my own to run mapreduce. However, if I want to control the number of records handled by each mapper (splitsize) and the startrow and endrow, then I thought I had to write my own getSplits(). Is there another way to accomplish this, because I do need the combination of controlled splitsize and start/endrow.
> >
> > -geoff
> >
> > -----Original Message-----
> > From: Ryan Rawson [mailto:ryanobjc@gmail.com]
> > Sent: Wednesday, February 09, 2011 11:43 PM
> > To: user@hbase.apache.org
> > Cc: hbase-user@hadoop.apache.org
> > Subject: Re: getSplits question
> >
> > You shouldn't need to write your own getSplits() method to run a map
> > reduce, I never did at least...
> >
> > -ryan
> >
> > On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey <gh...@decarta.com> wrote:
> >> Are endrows inclusive or exclusive? The docs say exclusive, but then the
> >> question arises as to how to form the last split for getSplits(). The
> >> code below runs fine, but I believe it is omitting some rows, perhaps
> >> b/c of the exclusive end row. For the final split, should the endrow be
> >> null? I tried that, and got what appeared to be a final split without an
> >> endrow at all. Would appreciate a pointer to the correct implementation
> >> of getSplits in which I desire to provide a startrow, endrow, and
> >> splitsize. Apparently this isn't it J :
> >>
> >>
> >>
> >> int splitSize = context.getConfiguration().getInt("splitsize", 1000);
> >>
> >>                byte[] splitStop = null;
> >>
> >>                String hostname = null;
> >>
> >>                while ((results = resultScanner.next(splitSize)).length
> >>> 0) {
> >>
> >>                    //   System.out.println("results
> >> :-------------------------- "+results);
> >>
> >>                    byte[] splitStart = results[0].getRow();
> >>
> >>                    splitStop = results[results.length - 1].getRow();
> >> //I think this is a problem...we don't actually include this row in the
> >> split since it's exclusive..revisit this and correct
> >>
> >>                    HRegionLocation location =
> >> table.getRegionLocation(splitStart);
> >>
> >>                    hostname =
> >> location.getServerAddress().getHostname();
> >>
> >>                    InputSplit split = new
> >> TableSplit(table.getTableName(), splitStart, splitStop, hostname);
> >>
> >>                    splits.add(split);
> >>
> >>                    System.out.println("initializing splits: " +
> >> split.toString());
> >>
> >>                }
> >>
> >>                resultScanner.close();
> >>
> >>
> >>
> >>
> >>
> >> -g
> >>
> >>
> >

Re: getSplits question

Posted by Ryan Rawson <ry...@gmail.com>.

By default each map gets the contents of 1 region. A region is by
default a maximum of 256MB. There is no trivial way to generally
bisect a region in half, in terms of row count, by just knowing what
we known (start, end key).

For very large tables that have > 100 regions, this algorithm works
really well and you get some good parallelism.  If you want to see a
lot of parallelism out of 1 region, you might have to work a lot
harder.  Or reduce your region size and have more regions.  Be warned
though, that more regions has performance hits in other areas
(specifically server startup/shutdown/assignment times).  So you
probably dont want 50,000 32MB regions.

-ryan

On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey <gh...@decarta.com> wrote:
> Oh, I definitely don't *need* my own to run mapreduce. However, if I want to control the number of records handled by each mapper (splitsize) and the startrow and endrow, then I thought I had to write my own getSplits(). Is there another way to accomplish this, because I do need the combination of controlled splitsize and start/endrow.
>
> -geoff
>
> -----Original Message-----
> From: Ryan Rawson [mailto:ryanobjc@gmail.com]
> Sent: Wednesday, February 09, 2011 11:43 PM
> To: user@hbase.apache.org
> Cc: hbase-user@hadoop.apache.org
> Subject: Re: getSplits question
>
> You shouldn't need to write your own getSplits() method to run a map
> reduce, I never did at least...
>
> -ryan
>
> On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey <gh...@decarta.com> wrote:
>> Are endrows inclusive or exclusive? The docs say exclusive, but then the
>> question arises as to how to form the last split for getSplits(). The
>> code below runs fine, but I believe it is omitting some rows, perhaps
>> b/c of the exclusive end row. For the final split, should the endrow be
>> null? I tried that, and got what appeared to be a final split without an
>> endrow at all. Would appreciate a pointer to the correct implementation
>> of getSplits in which I desire to provide a startrow, endrow, and
>> splitsize. Apparently this isn't it J :
>>
>>
>>
>> int splitSize = context.getConfiguration().getInt("splitsize", 1000);
>>
>>                byte[] splitStop = null;
>>
>>                String hostname = null;
>>
>>                while ((results = resultScanner.next(splitSize)).length
>>> 0) {
>>
>>                    //   System.out.println("results
>> :-------------------------- "+results);
>>
>>                    byte[] splitStart = results[0].getRow();
>>
>>                    splitStop = results[results.length - 1].getRow();
>> //I think this is a problem...we don't actually include this row in the
>> split since it's exclusive..revisit this and correct
>>
>>                    HRegionLocation location =
>> table.getRegionLocation(splitStart);
>>
>>                    hostname =
>> location.getServerAddress().getHostname();
>>
>>                    InputSplit split = new
>> TableSplit(table.getTableName(), splitStart, splitStop, hostname);
>>
>>                    splits.add(split);
>>
>>                    System.out.println("initializing splits: " +
>> split.toString());
>>
>>                }
>>
>>                resultScanner.close();
>>
>>
>>
>>
>>
>> -g
>>
>>
>

RE: getSplits question

Posted by Geoff Hendrey <gh...@decarta.com>.

Oh, I definitely don't *need* my own to run mapreduce. However, if I want to control the number of records handled by each mapper (splitsize) and the startrow and endrow, then I thought I had to write my own getSplits(). Is there another way to accomplish this, because I do need the combination of controlled splitsize and start/endrow.

-geoff

-----Original Message-----
From: Ryan Rawson [mailto:ryanobjc@gmail.com] 
Sent: Wednesday, February 09, 2011 11:43 PM
To: user@hbase.apache.org
Cc: hbase-user@hadoop.apache.org
Subject: Re: getSplits question

You shouldn't need to write your own getSplits() method to run a map
reduce, I never did at least...

-ryan

On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey <gh...@decarta.com> wrote:
> Are endrows inclusive or exclusive? The docs say exclusive, but then the
> question arises as to how to form the last split for getSplits(). The
> code below runs fine, but I believe it is omitting some rows, perhaps
> b/c of the exclusive end row. For the final split, should the endrow be
> null? I tried that, and got what appeared to be a final split without an
> endrow at all. Would appreciate a pointer to the correct implementation
> of getSplits in which I desire to provide a startrow, endrow, and
> splitsize. Apparently this isn't it J :
>
>
>
> int splitSize = context.getConfiguration().getInt("splitsize", 1000);
>
>                byte[] splitStop = null;
>
>                String hostname = null;
>
>                while ((results = resultScanner.next(splitSize)).length
>> 0) {
>
>                    //   System.out.println("results
> :-------------------------- "+results);
>
>                    byte[] splitStart = results[0].getRow();
>
>                    splitStop = results[results.length - 1].getRow();
> //I think this is a problem...we don't actually include this row in the
> split since it's exclusive..revisit this and correct
>
>                    HRegionLocation location =
> table.getRegionLocation(splitStart);
>
>                    hostname =
> location.getServerAddress().getHostname();
>
>                    InputSplit split = new
> TableSplit(table.getTableName(), splitStart, splitStop, hostname);
>
>                    splits.add(split);
>
>                    System.out.println("initializing splits: " +
> split.toString());
>
>                }
>
>                resultScanner.close();
>
>
>
>
>
> -g
>
>

Re: getSplits question

Posted by Ryan Rawson <ry...@gmail.com>.

You shouldn't need to write your own getSplits() method to run a map
reduce, I never did at least...

-ryan

On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey <gh...@decarta.com> wrote:
> Are endrows inclusive or exclusive? The docs say exclusive, but then the
> question arises as to how to form the last split for getSplits(). The
> code below runs fine, but I believe it is omitting some rows, perhaps
> b/c of the exclusive end row. For the final split, should the endrow be
> null? I tried that, and got what appeared to be a final split without an
> endrow at all. Would appreciate a pointer to the correct implementation
> of getSplits in which I desire to provide a startrow, endrow, and
> splitsize. Apparently this isn't it J :
>
>
>
> int splitSize = context.getConfiguration().getInt("splitsize", 1000);
>
>                byte[] splitStop = null;
>
>                String hostname = null;
>
>                while ((results = resultScanner.next(splitSize)).length
>> 0) {
>
>                    //   System.out.println("results
> :-------------------------- "+results);
>
>                    byte[] splitStart = results[0].getRow();
>
>                    splitStop = results[results.length - 1].getRow();
> //I think this is a problem...we don't actually include this row in the
> split since it's exclusive..revisit this and correct
>
>                    HRegionLocation location =
> table.getRegionLocation(splitStart);
>
>                    hostname =
> location.getServerAddress().getHostname();
>
>                    InputSplit split = new
> TableSplit(table.getTableName(), splitStart, splitStop, hostname);
>
>                    splits.add(split);
>
>                    System.out.println("initializing splits: " +
> split.toString());
>
>                }
>
>                resultScanner.close();
>
>
>
>
>
> -g
>
>