You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Serega Sheypak <se...@gmail.com> on 2014/10/02 21:03:51 UTC

BulkLoad 200GB table with one region. Is it OK?

Hi, I'm doing HBase bulk load to an empty table.
Input data size is 200GB
Is it OK to load data into one default region and then wait while HBase
splits 200GB region?

I don't have any SLA for initial load. I can wait unitl HBase splits
initial load files.
This table is READ only.

The only conideration is not affect others tables and do not cause HBase
cluster degradation.

Re: BulkLoad 200GB table with one region. Is it OK?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Even if you an 100 files. HBase will still need to read them to split then.
Each file might contains keys for the 2 regions, so HBase will read 200GB,
and write 100GB each side.

Last, I don't think the max file size will have any impact on the BulkLoad
side. It's the way you generate your file which matters. Can you have a
look at your output folder?

JM

2014-10-03 1:56 GMT-04:00 Serega Sheypak <se...@gmail.com>:

> There are several files generated. I suppose there 20 files because its a
> setting for hbase to have 10gb files.
> 03.10.2014 1:01 пользователь "Jean-Marc Spaggiari" <
> jean-marc@spaggiari.org>
> написал:
>
> > If it's a sigle 200GB file, when HBase will spit this region,this file
> will
> > have to be splitted and re-written into 2 x 100GB files.
> >
> > How is the file generated? You should really think about splitting it
> > first...
> >
> > 2014-10-02 15:49 GMT-04:00 Jerry He <je...@gmail.com>:
> >
> > > The reference files will be rewritten during compaction, which normally
> > > happens right after splits.
> > >
> > > You did not mention if your 200gb data is one file,or many hfiles.
> > >
> > > Jerry
> > > On Oct 2, 2014 12:26 PM, "Serega Sheypak" <se...@gmail.com>
> > > wrote:
> > >
> > > > Sorry, massive IO.
> > > > This table is read-only. So hbase should just place reference files,
> > why
> > > > Hbase would rewrite the files?
> > > >
> > > > 2014-10-02 23:24 GMT+04:00 Serega Sheypak <serega.sheypak@gmail.com
> >:
> > > >
> > > > > Hi!
> > > > >
> > http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
> > > > > Says that splitting is just a placing 'reference' file.
> > > > > Why there sould be massive splitting?
> > > > >
> > > > > 2014-10-02 23:08 GMT+04:00 Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org
> > > > >:
> > > > >
> > > > >> Hi Serega,
> > > > >>
> > > > >> Bulk load just "push" the file into an HBase region, so there
> should
> > > not
> > > > >> be
> > > > >> any issue. Split however might take some time because HBase will
> > have
> > > to
> > > > >> split it again and again util it become small enough. So if you
> max
> > > file
> > > > >> size is 10GB, it will split it to 100GB then 50GB then 25GB then
> > 12GB
> > > > then
> > > > >> 6GB... Each time, everything will be re-written. a LOT of wasted
> > IOs.
> > > > >>
> > > > >> So response is: Yes, HBase can handle BUT it's not a good
> practice.
> > > > Better
> > > > >> to split the table before and generate the bulk based on the
> splited
> > > > >> regions. Also, it might affect the others tables and the
> > performances
> > > > >> because HBase will have to do massive IOs, which at the end might
> > > impact
> > > > >> the performances.
> > > > >>
> > > > >> JM
> > > > >>
> > > > >> 2014-10-02 15:03 GMT-04:00 Serega Sheypak <
> serega.sheypak@gmail.com
> > >:
> > > > >>
> > > > >> > Hi, I'm doing HBase bulk load to an empty table.
> > > > >> > Input data size is 200GB
> > > > >> > Is it OK to load data into one default region and then wait
> while
> > > > HBase
> > > > >> > splits 200GB region?
> > > > >> >
> > > > >> > I don't have any SLA for initial load. I can wait unitl HBase
> > splits
> > > > >> > initial load files.
> > > > >> > This table is READ only.
> > > > >> >
> > > > >> > The only conideration is not affect others tables and do not
> cause
> > > > HBase
> > > > >> > cluster degradation.
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: BulkLoad 200GB table with one region. Is it OK?

Posted by Serega Sheypak <se...@gmail.com>.
There are several files generated. I suppose there 20 files because its a
setting for hbase to have 10gb files.
03.10.2014 1:01 пользователь "Jean-Marc Spaggiari" <je...@spaggiari.org>
написал:

> If it's a sigle 200GB file, when HBase will spit this region,this file will
> have to be splitted and re-written into 2 x 100GB files.
>
> How is the file generated? You should really think about splitting it
> first...
>
> 2014-10-02 15:49 GMT-04:00 Jerry He <je...@gmail.com>:
>
> > The reference files will be rewritten during compaction, which normally
> > happens right after splits.
> >
> > You did not mention if your 200gb data is one file,or many hfiles.
> >
> > Jerry
> > On Oct 2, 2014 12:26 PM, "Serega Sheypak" <se...@gmail.com>
> > wrote:
> >
> > > Sorry, massive IO.
> > > This table is read-only. So hbase should just place reference files,
> why
> > > Hbase would rewrite the files?
> > >
> > > 2014-10-02 23:24 GMT+04:00 Serega Sheypak <se...@gmail.com>:
> > >
> > > > Hi!
> > > >
> http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
> > > > Says that splitting is just a placing 'reference' file.
> > > > Why there sould be massive splitting?
> > > >
> > > > 2014-10-02 23:08 GMT+04:00 Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org
> > > >:
> > > >
> > > >> Hi Serega,
> > > >>
> > > >> Bulk load just "push" the file into an HBase region, so there should
> > not
> > > >> be
> > > >> any issue. Split however might take some time because HBase will
> have
> > to
> > > >> split it again and again util it become small enough. So if you max
> > file
> > > >> size is 10GB, it will split it to 100GB then 50GB then 25GB then
> 12GB
> > > then
> > > >> 6GB... Each time, everything will be re-written. a LOT of wasted
> IOs.
> > > >>
> > > >> So response is: Yes, HBase can handle BUT it's not a good practice.
> > > Better
> > > >> to split the table before and generate the bulk based on the splited
> > > >> regions. Also, it might affect the others tables and the
> performances
> > > >> because HBase will have to do massive IOs, which at the end might
> > impact
> > > >> the performances.
> > > >>
> > > >> JM
> > > >>
> > > >> 2014-10-02 15:03 GMT-04:00 Serega Sheypak <serega.sheypak@gmail.com
> >:
> > > >>
> > > >> > Hi, I'm doing HBase bulk load to an empty table.
> > > >> > Input data size is 200GB
> > > >> > Is it OK to load data into one default region and then wait while
> > > HBase
> > > >> > splits 200GB region?
> > > >> >
> > > >> > I don't have any SLA for initial load. I can wait unitl HBase
> splits
> > > >> > initial load files.
> > > >> > This table is READ only.
> > > >> >
> > > >> > The only conideration is not affect others tables and do not cause
> > > HBase
> > > >> > cluster degradation.
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: BulkLoad 200GB table with one region. Is it OK?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
If it's a sigle 200GB file, when HBase will spit this region,this file will
have to be splitted and re-written into 2 x 100GB files.

How is the file generated? You should really think about splitting it
first...

2014-10-02 15:49 GMT-04:00 Jerry He <je...@gmail.com>:

> The reference files will be rewritten during compaction, which normally
> happens right after splits.
>
> You did not mention if your 200gb data is one file,or many hfiles.
>
> Jerry
> On Oct 2, 2014 12:26 PM, "Serega Sheypak" <se...@gmail.com>
> wrote:
>
> > Sorry, massive IO.
> > This table is read-only. So hbase should just place reference files, why
> > Hbase would rewrite the files?
> >
> > 2014-10-02 23:24 GMT+04:00 Serega Sheypak <se...@gmail.com>:
> >
> > > Hi!
> > > http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
> > > Says that splitting is just a placing 'reference' file.
> > > Why there sould be massive splitting?
> > >
> > > 2014-10-02 23:08 GMT+04:00 Jean-Marc Spaggiari <
> jean-marc@spaggiari.org
> > >:
> > >
> > >> Hi Serega,
> > >>
> > >> Bulk load just "push" the file into an HBase region, so there should
> not
> > >> be
> > >> any issue. Split however might take some time because HBase will have
> to
> > >> split it again and again util it become small enough. So if you max
> file
> > >> size is 10GB, it will split it to 100GB then 50GB then 25GB then 12GB
> > then
> > >> 6GB... Each time, everything will be re-written. a LOT of wasted IOs.
> > >>
> > >> So response is: Yes, HBase can handle BUT it's not a good practice.
> > Better
> > >> to split the table before and generate the bulk based on the splited
> > >> regions. Also, it might affect the others tables and the performances
> > >> because HBase will have to do massive IOs, which at the end might
> impact
> > >> the performances.
> > >>
> > >> JM
> > >>
> > >> 2014-10-02 15:03 GMT-04:00 Serega Sheypak <se...@gmail.com>:
> > >>
> > >> > Hi, I'm doing HBase bulk load to an empty table.
> > >> > Input data size is 200GB
> > >> > Is it OK to load data into one default region and then wait while
> > HBase
> > >> > splits 200GB region?
> > >> >
> > >> > I don't have any SLA for initial load. I can wait unitl HBase splits
> > >> > initial load files.
> > >> > This table is READ only.
> > >> >
> > >> > The only conideration is not affect others tables and do not cause
> > HBase
> > >> > cluster degradation.
> > >> >
> > >>
> > >
> > >
> >
>

Re: BulkLoad 200GB table with one region. Is it OK?

Posted by Jerry He <je...@gmail.com>.
The reference files will be rewritten during compaction, which normally
happens right after splits.

You did not mention if your 200gb data is one file,or many hfiles.

Jerry
On Oct 2, 2014 12:26 PM, "Serega Sheypak" <se...@gmail.com> wrote:

> Sorry, massive IO.
> This table is read-only. So hbase should just place reference files, why
> Hbase would rewrite the files?
>
> 2014-10-02 23:24 GMT+04:00 Serega Sheypak <se...@gmail.com>:
>
> > Hi!
> > http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
> > Says that splitting is just a placing 'reference' file.
> > Why there sould be massive splitting?
> >
> > 2014-10-02 23:08 GMT+04:00 Jean-Marc Spaggiari <jean-marc@spaggiari.org
> >:
> >
> >> Hi Serega,
> >>
> >> Bulk load just "push" the file into an HBase region, so there should not
> >> be
> >> any issue. Split however might take some time because HBase will have to
> >> split it again and again util it become small enough. So if you max file
> >> size is 10GB, it will split it to 100GB then 50GB then 25GB then 12GB
> then
> >> 6GB... Each time, everything will be re-written. a LOT of wasted IOs.
> >>
> >> So response is: Yes, HBase can handle BUT it's not a good practice.
> Better
> >> to split the table before and generate the bulk based on the splited
> >> regions. Also, it might affect the others tables and the performances
> >> because HBase will have to do massive IOs, which at the end might impact
> >> the performances.
> >>
> >> JM
> >>
> >> 2014-10-02 15:03 GMT-04:00 Serega Sheypak <se...@gmail.com>:
> >>
> >> > Hi, I'm doing HBase bulk load to an empty table.
> >> > Input data size is 200GB
> >> > Is it OK to load data into one default region and then wait while
> HBase
> >> > splits 200GB region?
> >> >
> >> > I don't have any SLA for initial load. I can wait unitl HBase splits
> >> > initial load files.
> >> > This table is READ only.
> >> >
> >> > The only conideration is not affect others tables and do not cause
> HBase
> >> > cluster degradation.
> >> >
> >>
> >
> >
>

Re: BulkLoad 200GB table with one region. Is it OK?

Posted by Serega Sheypak <se...@gmail.com>.
Sorry, massive IO.
This table is read-only. So hbase should just place reference files, why
Hbase would rewrite the files?

2014-10-02 23:24 GMT+04:00 Serega Sheypak <se...@gmail.com>:

> Hi!
> http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
> Says that splitting is just a placing 'reference' file.
> Why there sould be massive splitting?
>
> 2014-10-02 23:08 GMT+04:00 Jean-Marc Spaggiari <je...@spaggiari.org>:
>
>> Hi Serega,
>>
>> Bulk load just "push" the file into an HBase region, so there should not
>> be
>> any issue. Split however might take some time because HBase will have to
>> split it again and again util it become small enough. So if you max file
>> size is 10GB, it will split it to 100GB then 50GB then 25GB then 12GB then
>> 6GB... Each time, everything will be re-written. a LOT of wasted IOs.
>>
>> So response is: Yes, HBase can handle BUT it's not a good practice. Better
>> to split the table before and generate the bulk based on the splited
>> regions. Also, it might affect the others tables and the performances
>> because HBase will have to do massive IOs, which at the end might impact
>> the performances.
>>
>> JM
>>
>> 2014-10-02 15:03 GMT-04:00 Serega Sheypak <se...@gmail.com>:
>>
>> > Hi, I'm doing HBase bulk load to an empty table.
>> > Input data size is 200GB
>> > Is it OK to load data into one default region and then wait while HBase
>> > splits 200GB region?
>> >
>> > I don't have any SLA for initial load. I can wait unitl HBase splits
>> > initial load files.
>> > This table is READ only.
>> >
>> > The only conideration is not affect others tables and do not cause HBase
>> > cluster degradation.
>> >
>>
>
>

Re: BulkLoad 200GB table with one region. Is it OK?

Posted by Serega Sheypak <se...@gmail.com>.
Hi!
http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
Says that splitting is just a placing 'reference' file.
Why there sould be massive splitting?

2014-10-02 23:08 GMT+04:00 Jean-Marc Spaggiari <je...@spaggiari.org>:

> Hi Serega,
>
> Bulk load just "push" the file into an HBase region, so there should not be
> any issue. Split however might take some time because HBase will have to
> split it again and again util it become small enough. So if you max file
> size is 10GB, it will split it to 100GB then 50GB then 25GB then 12GB then
> 6GB... Each time, everything will be re-written. a LOT of wasted IOs.
>
> So response is: Yes, HBase can handle BUT it's not a good practice. Better
> to split the table before and generate the bulk based on the splited
> regions. Also, it might affect the others tables and the performances
> because HBase will have to do massive IOs, which at the end might impact
> the performances.
>
> JM
>
> 2014-10-02 15:03 GMT-04:00 Serega Sheypak <se...@gmail.com>:
>
> > Hi, I'm doing HBase bulk load to an empty table.
> > Input data size is 200GB
> > Is it OK to load data into one default region and then wait while HBase
> > splits 200GB region?
> >
> > I don't have any SLA for initial load. I can wait unitl HBase splits
> > initial load files.
> > This table is READ only.
> >
> > The only conideration is not affect others tables and do not cause HBase
> > cluster degradation.
> >
>

Re: BulkLoad 200GB table with one region. Is it OK?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Serega,

Bulk load just "push" the file into an HBase region, so there should not be
any issue. Split however might take some time because HBase will have to
split it again and again util it become small enough. So if you max file
size is 10GB, it will split it to 100GB then 50GB then 25GB then 12GB then
6GB... Each time, everything will be re-written. a LOT of wasted IOs.

So response is: Yes, HBase can handle BUT it's not a good practice. Better
to split the table before and generate the bulk based on the splited
regions. Also, it might affect the others tables and the performances
because HBase will have to do massive IOs, which at the end might impact
the performances.

JM

2014-10-02 15:03 GMT-04:00 Serega Sheypak <se...@gmail.com>:

> Hi, I'm doing HBase bulk load to an empty table.
> Input data size is 200GB
> Is it OK to load data into one default region and then wait while HBase
> splits 200GB region?
>
> I don't have any SLA for initial load. I can wait unitl HBase splits
> initial load files.
> This table is READ only.
>
> The only conideration is not affect others tables and do not cause HBase
> cluster degradation.
>