You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Michael <mi...@fantasymail.de> on 2019/07/18 08:39:42 UTC

HBase 2 ,bulk import question

Hi,

I looked at the possibility of bulk importing into hbase, but somehow I
don't get it. I am not able to perform a presplitting of the data, so
does bulk importing work without presplitting?
As I understand it, instead of putting the data, I create the hbase
region files, but all tutorials I read mentioned presplitting...

So, is presplitting essential for bulk importing?

It would be really helpful, if someone could point me to demo
implementation of a bulk import.

Thanks for helping
 Michael

Re: HBase 2 ,bulk import question

Posted by OpenInx <op...@gmail.com>.

> To add to that, the split will be done on the master,
It's done locally, not master.  say the LoadIncrementHFile tool will split
the hfile
locally if found anyone is cross two or more regions.

On Fri, Jul 19, 2019 at 1:27 AM Jean-Marc Spaggiari <je...@spaggiari.org>
wrote:

> +1 to that last statement. (I think the split is done locally where you run
> the command, not sure if it's in the master, but I can be wrong). Means if
> you have a single big giant file and 200 regions, it will require a lot a
> non distributed work...
>
> Le jeu. 18 juil. 2019 à 13:03, Austin Heyne <ah...@ccri.com> a écrit :
>
> > To add to that, the split will be done on the master, so if you
> > anticipate a lot of splits it can be an issue.
> >
> > -Austin
> >
> > On 7/18/19 12:32 PM, Jean-Marc Spaggiari wrote:
> > > One think to add, when you will bulkload your files, if needed, they
> will
> > > be split according to the regions boundaries.
> > >
> > > Because between when you start your job and when you push your files,
> > there
> > > might have been some "natural" splits on the table side, the bulkloader
> > has
> > > to be able to re-split your generated data.
> > >
> > > JMS
> > >
> > > Le jeu. 18 juil. 2019 à 09:55, OpenInx <op...@gmail.com> a écrit :
> > >
> > >> Austin is right. The pre-splitting is mainly used for generate&load
> > HFiles,
> > >> say
> > >> when do bulkload, it will load each generated hfile to the
> corresponding
> > >> region
> > >> who include the rowkey interval of the hfile. If no pre-splitting,
> then
> > all
> > >> HFiles
> > >> will be in one region, bulkload will be time-consuming and it's easy
> to
> > be
> > >> hotspot
> > >> when query coming in.
> > >>
> > >> About the demo, you can see here:
> > >> [1]. https://hbase.apache.org/book.html#arch.bulk.load
> > >> [2].
> > >>
> > >>
> >
> http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/
> > >>
> > >> Thanks.
> > >>
> > >> On Thu, Jul 18, 2019 at 9:21 PM Austin Heyne <ah...@ccri.com> wrote:
> > >>
> > >>> Bulk importing requires the table the data is being bulk imported
> into
> > >>> to already exists. This is because the mapreduce job needs to extract
> > >>> the region start/end keys in order to drive the reducers. This means
> > >>> that you need to create your table before hand, providing the
> > >>> appropriate pre-splitting and then run your bulk ingest and bulk load
> > to
> > >>> get the data into the table. If you were to not pre-split your table
> > >>> then you would end up with one reducer in your bulk ingest job. This
> > >>> also means that your bulk ingest cluster will need to be able to
> > >>> communicate with your HBase instance.
> > >>>
> > >>> -Austin
> > >>>
> > >>> On 7/18/19 4:39 AM, Michael wrote:
> > >>>> Hi,
> > >>>>
> > >>>> I looked at the possibility of bulk importing into hbase, but
> somehow
> > I
> > >>>> don't get it. I am not able to perform a presplitting of the data,
> so
> > >>>> does bulk importing work without presplitting?
> > >>>> As I understand it, instead of putting the data, I create the hbase
> > >>>> region files, but all tutorials I read mentioned presplitting...
> > >>>>
> > >>>> So, is presplitting essential for bulk importing?
> > >>>>
> > >>>> It would be really helpful, if someone could point me to demo
> > >>>> implementation of a bulk import.
> > >>>>
> > >>>> Thanks for helping
> > >>>>    Michael
> > >>>>
> > >>>>
> >
>

Re: HBase 2 ,bulk import question

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

+1 to that last statement. (I think the split is done locally where you run
the command, not sure if it's in the master, but I can be wrong). Means if
you have a single big giant file and 200 regions, it will require a lot a
non distributed work...

Le jeu. 18 juil. 2019 à 13:03, Austin Heyne <ah...@ccri.com> a écrit :

> To add to that, the split will be done on the master, so if you
> anticipate a lot of splits it can be an issue.
>
> -Austin
>
> On 7/18/19 12:32 PM, Jean-Marc Spaggiari wrote:
> > One think to add, when you will bulkload your files, if needed, they will
> > be split according to the regions boundaries.
> >
> > Because between when you start your job and when you push your files,
> there
> > might have been some "natural" splits on the table side, the bulkloader
> has
> > to be able to re-split your generated data.
> >
> > JMS
> >
> > Le jeu. 18 juil. 2019 à 09:55, OpenInx <op...@gmail.com> a écrit :
> >
> >> Austin is right. The pre-splitting is mainly used for generate&load
> HFiles,
> >> say
> >> when do bulkload, it will load each generated hfile to the corresponding
> >> region
> >> who include the rowkey interval of the hfile. If no pre-splitting, then
> all
> >> HFiles
> >> will be in one region, bulkload will be time-consuming and it's easy to
> be
> >> hotspot
> >> when query coming in.
> >>
> >> About the demo, you can see here:
> >> [1]. https://hbase.apache.org/book.html#arch.bulk.load
> >> [2].
> >>
> >>
> http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/
> >>
> >> Thanks.
> >>
> >> On Thu, Jul 18, 2019 at 9:21 PM Austin Heyne <ah...@ccri.com> wrote:
> >>
> >>> Bulk importing requires the table the data is being bulk imported into
> >>> to already exists. This is because the mapreduce job needs to extract
> >>> the region start/end keys in order to drive the reducers. This means
> >>> that you need to create your table before hand, providing the
> >>> appropriate pre-splitting and then run your bulk ingest and bulk load
> to
> >>> get the data into the table. If you were to not pre-split your table
> >>> then you would end up with one reducer in your bulk ingest job. This
> >>> also means that your bulk ingest cluster will need to be able to
> >>> communicate with your HBase instance.
> >>>
> >>> -Austin
> >>>
> >>> On 7/18/19 4:39 AM, Michael wrote:
> >>>> Hi,
> >>>>
> >>>> I looked at the possibility of bulk importing into hbase, but somehow
> I
> >>>> don't get it. I am not able to perform a presplitting of the data, so
> >>>> does bulk importing work without presplitting?
> >>>> As I understand it, instead of putting the data, I create the hbase
> >>>> region files, but all tutorials I read mentioned presplitting...
> >>>>
> >>>> So, is presplitting essential for bulk importing?
> >>>>
> >>>> It would be really helpful, if someone could point me to demo
> >>>> implementation of a bulk import.
> >>>>
> >>>> Thanks for helping
> >>>>    Michael
> >>>>
> >>>>
>

Re: HBase 2 ,bulk import question

Posted by Austin Heyne <ah...@ccri.com>.

To add to that, the split will be done on the master, so if you 
anticipate a lot of splits it can be an issue.

-Austin

On 7/18/19 12:32 PM, Jean-Marc Spaggiari wrote:
> One think to add, when you will bulkload your files, if needed, they will
> be split according to the regions boundaries.
>
> Because between when you start your job and when you push your files, there
> might have been some "natural" splits on the table side, the bulkloader has
> to be able to re-split your generated data.
>
> JMS
>
> Le jeu. 18 juil. 2019 à 09:55, OpenInx <op...@gmail.com> a écrit :
>
>> Austin is right. The pre-splitting is mainly used for generate&load HFiles,
>> say
>> when do bulkload, it will load each generated hfile to the corresponding
>> region
>> who include the rowkey interval of the hfile. If no pre-splitting, then all
>> HFiles
>> will be in one region, bulkload will be time-consuming and it's easy to be
>> hotspot
>> when query coming in.
>>
>> About the demo, you can see here:
>> [1]. https://hbase.apache.org/book.html#arch.bulk.load
>> [2].
>>
>> http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/
>>
>> Thanks.
>>
>> On Thu, Jul 18, 2019 at 9:21 PM Austin Heyne <ah...@ccri.com> wrote:
>>
>>> Bulk importing requires the table the data is being bulk imported into
>>> to already exists. This is because the mapreduce job needs to extract
>>> the region start/end keys in order to drive the reducers. This means
>>> that you need to create your table before hand, providing the
>>> appropriate pre-splitting and then run your bulk ingest and bulk load to
>>> get the data into the table. If you were to not pre-split your table
>>> then you would end up with one reducer in your bulk ingest job. This
>>> also means that your bulk ingest cluster will need to be able to
>>> communicate with your HBase instance.
>>>
>>> -Austin
>>>
>>> On 7/18/19 4:39 AM, Michael wrote:
>>>> Hi,
>>>>
>>>> I looked at the possibility of bulk importing into hbase, but somehow I
>>>> don't get it. I am not able to perform a presplitting of the data, so
>>>> does bulk importing work without presplitting?
>>>> As I understand it, instead of putting the data, I create the hbase
>>>> region files, but all tutorials I read mentioned presplitting...
>>>>
>>>> So, is presplitting essential for bulk importing?
>>>>
>>>> It would be really helpful, if someone could point me to demo
>>>> implementation of a bulk import.
>>>>
>>>> Thanks for helping
>>>>    Michael
>>>>
>>>>

Re: HBase 2 ,bulk import question

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

One think to add, when you will bulkload your files, if needed, they will
be split according to the regions boundaries.

Because between when you start your job and when you push your files, there
might have been some "natural" splits on the table side, the bulkloader has
to be able to re-split your generated data.

JMS

Le jeu. 18 juil. 2019 à 09:55, OpenInx <op...@gmail.com> a écrit :

> Austin is right. The pre-splitting is mainly used for generate&load HFiles,
> say
> when do bulkload, it will load each generated hfile to the corresponding
> region
> who include the rowkey interval of the hfile. If no pre-splitting, then all
> HFiles
> will be in one region, bulkload will be time-consuming and it's easy to be
> hotspot
> when query coming in.
>
> About the demo, you can see here:
> [1]. https://hbase.apache.org/book.html#arch.bulk.load
> [2].
>
> http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/
>
> Thanks.
>
> On Thu, Jul 18, 2019 at 9:21 PM Austin Heyne <ah...@ccri.com> wrote:
>
> > Bulk importing requires the table the data is being bulk imported into
> > to already exists. This is because the mapreduce job needs to extract
> > the region start/end keys in order to drive the reducers. This means
> > that you need to create your table before hand, providing the
> > appropriate pre-splitting and then run your bulk ingest and bulk load to
> > get the data into the table. If you were to not pre-split your table
> > then you would end up with one reducer in your bulk ingest job. This
> > also means that your bulk ingest cluster will need to be able to
> > communicate with your HBase instance.
> >
> > -Austin
> >
> > On 7/18/19 4:39 AM, Michael wrote:
> > > Hi,
> > >
> > > I looked at the possibility of bulk importing into hbase, but somehow I
> > > don't get it. I am not able to perform a presplitting of the data, so
> > > does bulk importing work without presplitting?
> > > As I understand it, instead of putting the data, I create the hbase
> > > region files, but all tutorials I read mentioned presplitting...
> > >
> > > So, is presplitting essential for bulk importing?
> > >
> > > It would be really helpful, if someone could point me to demo
> > > implementation of a bulk import.
> > >
> > > Thanks for helping
> > >   Michael
> > >
> > >
> >
>

Re: HBase 2 ,bulk import question

Posted by OpenInx <op...@gmail.com>.

Austin is right. The pre-splitting is mainly used for generate&load HFiles,
say
when do bulkload, it will load each generated hfile to the corresponding
region
who include the rowkey interval of the hfile. If no pre-splitting, then all
HFiles
will be in one region, bulkload will be time-consuming and it's easy to be
hotspot
when query coming in.

About the demo, you can see here:
[1]. https://hbase.apache.org/book.html#arch.bulk.load
[2].
http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/

Thanks.

On Thu, Jul 18, 2019 at 9:21 PM Austin Heyne <ah...@ccri.com> wrote:

> Bulk importing requires the table the data is being bulk imported into
> to already exists. This is because the mapreduce job needs to extract
> the region start/end keys in order to drive the reducers. This means
> that you need to create your table before hand, providing the
> appropriate pre-splitting and then run your bulk ingest and bulk load to
> get the data into the table. If you were to not pre-split your table
> then you would end up with one reducer in your bulk ingest job. This
> also means that your bulk ingest cluster will need to be able to
> communicate with your HBase instance.
>
> -Austin
>
> On 7/18/19 4:39 AM, Michael wrote:
> > Hi,
> >
> > I looked at the possibility of bulk importing into hbase, but somehow I
> > don't get it. I am not able to perform a presplitting of the data, so
> > does bulk importing work without presplitting?
> > As I understand it, instead of putting the data, I create the hbase
> > region files, but all tutorials I read mentioned presplitting...
> >
> > So, is presplitting essential for bulk importing?
> >
> > It would be really helpful, if someone could point me to demo
> > implementation of a bulk import.
> >
> > Thanks for helping
> >   Michael
> >
> >
>

Re: HBase 2 ,bulk import question

Posted by Austin Heyne <ah...@ccri.com>.

Bulk importing requires the table the data is being bulk imported into 
to already exists. This is because the mapreduce job needs to extract 
the region start/end keys in order to drive the reducers. This means 
that you need to create your table before hand, providing the 
appropriate pre-splitting and then run your bulk ingest and bulk load to 
get the data into the table. If you were to not pre-split your table 
then you would end up with one reducer in your bulk ingest job. This 
also means that your bulk ingest cluster will need to be able to 
communicate with your HBase instance.

-Austin

On 7/18/19 4:39 AM, Michael wrote:
> Hi,
>
> I looked at the possibility of bulk importing into hbase, but somehow I
> don't get it. I am not able to perform a presplitting of the data, so
> does bulk importing work without presplitting?
> As I understand it, instead of putting the data, I create the hbase
> region files, but all tutorials I read mentioned presplitting...
>
> So, is presplitting essential for bulk importing?
>
> It would be really helpful, if someone could point me to demo
> implementation of a bulk import.
>
> Thanks for helping
>   Michael
>
>