You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by David Alves <dr...@criticalsoftware.com> on 2008/07/31 15:06:27 UTC

Region Splits

Hi Guys

	I use hbase (amongst other things) to crawl some repos of infomation
and util now I've been using the Nutch segment generation paradigm. I
would very much like to skip the segment generation step using hbase as
source and sink directly but in order to do that I would need to either
allow more that one split to be generated for a single region or make
the regions in this particular table split with much less entries than
other tables.
	Is any of this possible?

Regards
David Alves

PS: Thanks Jim and Stack for your hard work on this great piece of
software. Hoping to see you guys commiting again soon, but either way
you have already done great work.

Re: Region Splits

Posted by David Alves <dr...@criticalsoftware.com>.

Hi JD
	Thanks for the reply.
	Regarding the filesize parameters if I'm not mistaken these apply to
all tables right? Or can we configure it for a specific table? this
because this is actually a table with a lot of entries but each entry is
very small, so if I set the filesize param for the amount of entries I
need (like 1.5K entries) and this applies to other tables there are ones
that would create a region per entry :).
	Regarding the other option I'm glad to try and implement it but would
appreciate any guidance. Would definitely need a row counter, and a
means of getting the nth row, I seem to recall a JIRA about the counter
but don't know about the second issue. Any thoughts?

David



On Thu, 2008-07-31 at 09:49 -0400, Jean-Daniel Cryans wrote:
> David,
> 
> If having regions splitting below the default threshold is what you want,
> you can change the configuration parameter "hbase.hregion.max.filesize"
> which by default is set to 256M. Regarding the other option, I don't think
> it's easily doable.
> 
> J-D
> 
> On Thu, Jul 31, 2008 at 9:06 AM, David Alves
> <dr...@criticalsoftware.com>wrote:
> 
> > Hi Guys
> >
> >        I use hbase (amongst other things) to crawl some repos of infomation
> > and util now I've been using the Nutch segment generation paradigm. I
> > would very much like to skip the segment generation step using hbase as
> > source and sink directly but in order to do that I would need to either
> > allow more that one split to be generated for a single region or make
> > the regions in this particular table split with much less entries than
> > other tables.
> >        Is any of this possible?
> >
> > Regards
> > David Alves
> >
> > PS: Thanks Jim and Stack for your hard work on this great piece of
> > software. Hoping to see you guys commiting again soon, but either way
> > you have already done great work.
> >
> >

Re: Region Splits

Posted by Jean-Daniel Cryans <jd...@gmail.com>.

David,

If having regions splitting below the default threshold is what you want,
you can change the configuration parameter "hbase.hregion.max.filesize"
which by default is set to 256M. Regarding the other option, I don't think
it's easily doable.

J-D

On Thu, Jul 31, 2008 at 9:06 AM, David Alves
<dr...@criticalsoftware.com>wrote:

> Hi Guys
>
>        I use hbase (amongst other things) to crawl some repos of infomation
> and util now I've been using the Nutch segment generation paradigm. I
> would very much like to skip the segment generation step using hbase as
> source and sink directly but in order to do that I would need to either
> allow more that one split to be generated for a single region or make
> the regions in this particular table split with much less entries than
> other tables.
>        Is any of this possible?
>
> Regards
> David Alves
>
> PS: Thanks Jim and Stack for your hard work on this great piece of
> software. Hoping to see you guys commiting again soon, but either way
> you have already done great work.
>
>

Re: Region Splits

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

Good to hear

Thanks

Billy

"Andrew Purtell" <ap...@yahoo.com> wrote in 
message news:148502.78016.qm@web65506.mail.ac4.yahoo.com...
> All options like that are implemented as HCD or HTD attributes,
> so a general get/set attribute interface will cover everything
> (including user attributes).
>
>   - Andy
>
>> From: Billy Pearson 
>> <sa...@pearsonwholesale.com>
>> Subject: Re: Region Splits
>> To: hbase-user@hadoop.apache.org
>> Date: Wednesday, August 6, 2008, 11:18 PM
>> Is HBASE-800 Going to cover the read only option for shell
>> also or should I
>> open a ticket for that too.
>>
>> Billy
>>
>> "Andrew Purtell" <ap...@yahoo.com> wrote 
>> in
>> message 
>> news:108405.53260.qm@web65504.mail.ac4.yahoo.com...
>> > Good suggestion.
>> >
>> > Created HBASE-800.
>> >
>> >   - Andy
>
>
>
>
>

Re: Region Splits

Posted by Andrew Purtell <ap...@yahoo.com>.

All options like that are implemented as HCD or HTD attributes,
so a general get/set attribute interface will cover everything 
(including user attributes). 

   - Andy

> From: Billy Pearson <sa...@pearsonwholesale.com>
> Subject: Re: Region Splits
> To: hbase-user@hadoop.apache.org
> Date: Wednesday, August 6, 2008, 11:18 PM
> Is HBASE-800 Going to cover the read only option for shell
> also or should I 
> open a ticket for that too.
> 
> Billy
> 
> "Andrew Purtell" <ap...@yahoo.com> wrote in 
> message news:108405.53260.qm@web65504.mail.ac4.yahoo.com...
> > Good suggestion.
> >
> > Created HBASE-800.
> >
> >   - Andy

Re: Region Splits

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

Is HBASE-800 Going to cover the read only option for shell also or should I 
open a ticket for that too.

Billy



"Andrew Purtell" <ap...@yahoo.com> wrote in 
message news:108405.53260.qm@web65504.mail.ac4.yahoo.com...
> Good suggestion.
>
> Created HBASE-800.
>
>   - Andy
>
>> From: Billy Pearson 
>> <sa...@pearsonwholesale.com>
>> Subject: Re: Region Splits
>> To: hbase-user@hadoop.apache.org
>> Date: Wednesday, August 6, 2008, 4:33 PM
>> Hey Andrew
>> Do we have plans to include setMaxFileSize for the
>> shell,thrift,rest?
>>
>> So non java users can change this as needed with out having
>> to learn java.
>>
>> Billy
>>
>> "Andrew Purtell" <ap...@yahoo.com> wrote
>> in
>> message 
>> news:189371.9860.qm@web65516.mail.ac4.yahoo.com...
>> > Hello David,
>> >
>> > Current trunk (upcoming 0.2.0) has support for
>> per-table metadata. See
>> > https://issues.apache.org/jira/browse/HBASE-42 and
>> > https://issues.apache.org/jira/browse/HBASE-62.
>> >
>> > So maybe you can set the split threshold quite low for
>> the table in
>> > question?
>> >
>> > The default is 256MB (268435456), set globally for all
>> tables in the HBase
>> > configuration as
>> "hbase.hregion.max.filesize". However it's
>> reasonable to
>> > set it as low as the DFS blocksize. The guidance for a
>> typical HBase
>> > installation is to set the DFS blocksize to 8MB
>> (8388608), instead of the
>> > default 64MB.
>> >
>> > At create time:
>> >
>> >  HTableDescriptor htd = new
>> HTableDescriptor("foo");
>> >  htd.setMaxFileSize(8388608);
>> >  ...
>> >  HBaseAdmin admin = new HBaseAdmin(hconf);
>> >  admin.createTable(htd);
>> >
>> > If the table already exists:
>> >
>> >  HTable table = new HTable(hconf, "foo");
>> >  admin.disableTable("foo");
>> >  // make a read-write descriptor
>> >  HTableDescriptor htd =
>> >    new HTableDescriptor(table.getTableDescriptor());
>> >  htd.setMaxFileSize(83388608);
>> >  admin.modifyTableMeta("foo", htd);
>> >  admin.enableTable("foo");
>> >
>> > Hope this helps,
>> >
>> >   - Andy
>> >
>> >> From: David Alves
>> >> <dr...@criticalsoftware.com>
>> >> Subject: Region Splits
>> >> To: "hbase-user@hadoop.apache.org"
>> >> <hb...@hadoop.apache.org>
>> >> Date: Thursday, July 31, 2008, 6:06 AM
>> > [...]
>> >> I use hbase (amongst other things) to crawl some
>> repos of infomation
>> >> and util now I've been using the Nutch segment
>> generation paradigm.
>> >> I would very much like to skip the segment
>> generation step using
>> >> hbase as source and sink directly but in order to
>> do that I would
>> >> need to either allow more that one split to be
>> generated for a
>> >> single region or make the regions in this
>> particular table split
>> >> with much less entries than other tables.
>> > [...]
>> >
>> >
>> >
>> >
>> >
>
>
>
>

Re: Region Splits

Posted by Andrew Purtell <ap...@yahoo.com>.

Good suggestion.

Created HBASE-800.

   - Andy

> From: Billy Pearson <sa...@pearsonwholesale.com>
> Subject: Re: Region Splits
> To: hbase-user@hadoop.apache.org
> Date: Wednesday, August 6, 2008, 4:33 PM
> Hey Andrew
> Do we have plans to include setMaxFileSize for the
> shell,thrift,rest?
> 
> So non java users can change this as needed with out having
> to learn java.
> 
> Billy
> 
> "Andrew Purtell" <ap...@yahoo.com> wrote
> in 
> message news:189371.9860.qm@web65516.mail.ac4.yahoo.com...
> > Hello David,
> >
> > Current trunk (upcoming 0.2.0) has support for
> per-table metadata. See 
> > https://issues.apache.org/jira/browse/HBASE-42 and 
> > https://issues.apache.org/jira/browse/HBASE-62.
> >
> > So maybe you can set the split threshold quite low for
> the table in 
> > question?
> >
> > The default is 256MB (268435456), set globally for all
> tables in the HBase 
> > configuration as
> "hbase.hregion.max.filesize". However it's
> reasonable to 
> > set it as low as the DFS blocksize. The guidance for a
> typical HBase 
> > installation is to set the DFS blocksize to 8MB
> (8388608), instead of the 
> > default 64MB.
> >
> > At create time:
> >
> >  HTableDescriptor htd = new
> HTableDescriptor("foo");
> >  htd.setMaxFileSize(8388608);
> >  ...
> >  HBaseAdmin admin = new HBaseAdmin(hconf);
> >  admin.createTable(htd);
> >
> > If the table already exists:
> >
> >  HTable table = new HTable(hconf, "foo");
> >  admin.disableTable("foo");
> >  // make a read-write descriptor
> >  HTableDescriptor htd =
> >    new HTableDescriptor(table.getTableDescriptor());
> >  htd.setMaxFileSize(83388608);
> >  admin.modifyTableMeta("foo", htd);
> >  admin.enableTable("foo");
> >
> > Hope this helps,
> >
> >   - Andy
> >
> >> From: David Alves 
> >> <dr...@criticalsoftware.com>
> >> Subject: Region Splits
> >> To: "hbase-user@hadoop.apache.org" 
> >> <hb...@hadoop.apache.org>
> >> Date: Thursday, July 31, 2008, 6:06 AM
> > [...]
> >> I use hbase (amongst other things) to crawl some
> repos of infomation
> >> and util now I've been using the Nutch segment
> generation paradigm.
> >> I would very much like to skip the segment
> generation step using
> >> hbase as source and sink directly but in order to
> do that I would
> >> need to either allow more that one split to be
> generated for a
> >> single region or make the regions in this
> particular table split
> >> with much less entries than other tables.
> > [...]
> >
> >
> >
> >
> >

Re: Region Splits

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

Hey Andrew
Do we have plans to include setMaxFileSize for the shell,thrift,rest?

So non java users can change this as needed with out having to learn java.

Billy

"Andrew Purtell" <ap...@yahoo.com> wrote in 
message news:189371.9860.qm@web65516.mail.ac4.yahoo.com...
> Hello David,
>
> Current trunk (upcoming 0.2.0) has support for per-table metadata. See 
> https://issues.apache.org/jira/browse/HBASE-42 and 
> https://issues.apache.org/jira/browse/HBASE-62.
>
> So maybe you can set the split threshold quite low for the table in 
> question?
>
> The default is 256MB (268435456), set globally for all tables in the HBase 
> configuration as "hbase.hregion.max.filesize". However it's reasonable to 
> set it as low as the DFS blocksize. The guidance for a typical HBase 
> installation is to set the DFS blocksize to 8MB (8388608), instead of the 
> default 64MB.
>
> At create time:
>
>  HTableDescriptor htd = new HTableDescriptor("foo");
>  htd.setMaxFileSize(8388608);
>  ...
>  HBaseAdmin admin = new HBaseAdmin(hconf);
>  admin.createTable(htd);
>
> If the table already exists:
>
>  HTable table = new HTable(hconf, "foo");
>  admin.disableTable("foo");
>  // make a read-write descriptor
>  HTableDescriptor htd =
>    new HTableDescriptor(table.getTableDescriptor());
>  htd.setMaxFileSize(83388608);
>  admin.modifyTableMeta("foo", htd);
>  admin.enableTable("foo");
>
> Hope this helps,
>
>   - Andy
>
>> From: David Alves 
>> <dr...@criticalsoftware.com>
>> Subject: Region Splits
>> To: "hbase-user@hadoop.apache.org" 
>> <hb...@hadoop.apache.org>
>> Date: Thursday, July 31, 2008, 6:06 AM
> [...]
>> I use hbase (amongst other things) to crawl some repos of infomation
>> and util now I've been using the Nutch segment generation paradigm.
>> I would very much like to skip the segment generation step using
>> hbase as source and sink directly but in order to do that I would
>> need to either allow more that one split to be generated for a
>> single region or make the regions in this particular table split
>> with much less entries than other tables.
> [...]
>
>
>
>
>

Re: Region Splits

Posted by Andrew Purtell <ap...@yahoo.com>.

Hello David,

Current trunk (upcoming 0.2.0) has support for per-table metadata. See https://issues.apache.org/jira/browse/HBASE-42 and https://issues.apache.org/jira/browse/HBASE-62. 

So maybe you can set the split threshold quite low for the table in question?

The default is 256MB (268435456), set globally for all tables in the HBase configuration as "hbase.hregion.max.filesize". However it's reasonable to set it as low as the DFS blocksize. The guidance for a typical HBase installation is to set the DFS blocksize to 8MB (8388608), instead of the default 64MB. 

At create time:

  HTableDescriptor htd = new HTableDescriptor("foo");
  htd.setMaxFileSize(8388608);
  ...
  HBaseAdmin admin = new HBaseAdmin(hconf);
  admin.createTable(htd);

If the table already exists:

  HTable table = new HTable(hconf, "foo");
  admin.disableTable("foo");
  // make a read-write descriptor
  HTableDescriptor htd =
    new HTableDescriptor(table.getTableDescriptor());
  htd.setMaxFileSize(83388608);
  admin.modifyTableMeta("foo", htd);
  admin.enableTable("foo");

Hope this helps, 

   - Andy

> From: David Alves <dr...@criticalsoftware.com>
> Subject: Region Splits
> To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
> Date: Thursday, July 31, 2008, 6:06 AM
[...]
> I use hbase (amongst other things) to crawl some repos of infomation
> and util now I've been using the Nutch segment generation paradigm.
> I would very much like to skip the segment generation step using
> hbase as source and sink directly but in order to do that I would
> need to either allow more that one split to be generated for a
> single region or make the regions in this particular table split
> with much less entries than other tables.
[...]