You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Rita <rm...@gmail.com> on 2012/08/04 04:43:21 UTC

adding data

I have a file which has 13 billion rows of key an value which I would like
to place in Hbase. I was wondering if anyone has a good example to provide
with mapreduce for some sort of work like this.


tia


-- 
--- Get your facts first, then you can distort them as you please.--

Re: adding data

Posted by Hamed Ghavamnia <gh...@gmail.com>.
First of all thanks for the response, as for your questions:

*You already see a problem with sequential keys...
*A part of my keys are the a timestamp, so that's why my insertion will be
with sequential keys, but I tried changing that by reversing the timestamp,
e.g. if the timestamp (in milliseconds) were 1234567891543 I would store
3451987654321 as my key, by doing so my keys were sequential any more.
Hbase still stores them all in one region and on one node.
*
What is you planned access patterns? Size of the row, growth rate? Decay
rate?
(do you even delete the data?)*
The data I want to store is an xml data type which is at maximum 2KB, but
the fields are around 200 bytes. So each row has something around 200 bytes
of raw data which needs to be stored. Growth rate? I don't think I
understand what you mean by growth rate. The maximum rate I have store data
is 15000 per second, and each one is at maximum 2 KB, which after parsing
has 200 bytes of raw data. This data needs to be held for 90 days and then
the old ones can be removed.

* Does the Schema make sense, or do you want to look at Asynchronous HBase?*
What's Asynchronous HBase?

* Then there are other considerations...
Like your network and hardware...
What are you running on?
Memory, CPU, disk ... (ssd's?)
* The nodes are running on servers with sas HDDs. I have a virtualization
layer which allows me to run multiple nodes on each server. Each node will
have access to it's own HDD.
I'm planning on assigning something around 16 cores and 16 GB of RAM to the
Hbase Master which will be running my HDFS Namenode and Zookeeper Master as
well.
And another 4 cores and 4 GB of RAM will be assigned to the Region Servers
which is running my HDFS Datanodes as well. I will have a Gbit network
infrastructure in which my nodes will be connected to each other.

Thanks again.

On Sat, Aug 4, 2012 at 5:10 PM, Michel Segel <mi...@hotmail.com>wrote:

> Ok, a couple of things....
>
> First, a contrarian piece of advice....
> Don't base your performance tuning on your initial load, but on your
> system at its steady state.
> It's a simple concept that people forget and it can cause problems down
> the road....
>
> So we have two problems...
>
> Rita w 13 billion rows and Hamed w 15,000 row inserts per second.
>
> Both are distinct problems...
>
> Rita, what constraints do you have? Have you thought about your schema?
> Have you thought about your region size? Have you tuned up HBase? How long
> do you have to load the data?
> What is the growth and use of the data?
> (these are pretty much the same for a DW,ODS, OLTP and NoSQL that and DBA
> would face.)
>
> While you were already pointed to the bulk load, I thought you should also
> think about the other issues too.
>
> Hamed,
> 15k rows a second?
>
> You have a slightly different problem. Rita asks about initial load. You
> have an issue with sustained input rate .
>
> You already see a problem with sequential keys...
> What is you planned access patterns? Size of the row, growth rate? Decay
> rate?
> (do you even delete the data?)
> Does the Schema make sense, or do you want to look at Asynchronous HBase?
>
> Then there are other considerations...
> Like your network and hardware...
> What are you running on?
> Memory, CPU, disk ... (ssd's?)
>
> A lot of unknown factors... So to help we're going to need more
> information....
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Aug 4, 2012, at 2:44 AM, Hamed Ghavamnia <gh...@gmail.com> wrote:
>
> > Hi,
> > I'm facing a somewhat similar problem. I need to insert 15000 rows per
> > second into hbase. I'm getting really bad results using the simple put
> > api's (with multithreading). I've tried map/reduce integration as well.
> The
> > problem seems to be the type of the row keys. My row keys have an
> > incremental type, which makes hbase store them in the same region and
> > therefore on the same node. I've tried changing my keys to a more random
> > type, but still hbase stores them in the same region.
> > Any solutions would be appreciated, some things which have crossed my
> mind:
> > 1. To presplit my regions, but I'm not sure if the problem has anything
> to
> > do with the regions.
> > 2. Use the bulk load stated in you emails, but I don't where to start
> from.
> > Do you have a link to a sample code which can be used?
> > Any ideas?
> >
> > On Sat, Aug 4, 2012 at 10:09 AM, anil gupta <an...@gmail.com>
> wrote:
> >
> >> Hi Rita,
> >>
> >> HBase Bulk Loader is a viable solution for loading such huge data set.
> Even
> >> if your import file has a separator other than tab you can use
> ImportTsv as
> >> long as the separator is single character. If in case you want to put in
> >> your business logic while writing the data to HBase then you can write
> your
> >> own mapper class and use it with bulk loader. Hence, you can heavily
> >> customize the bulk loader as per your needs.
> >> These links might be helpful for you:
> >> http://hbase.apache.org/book.html#arch.bulk.load
> >>
> http://bigdatanoob.blogspot.com/2012/03/bulk-load-csv-file-into-hbase.html
> >>
> >> HTH,
> >> Anil Gupta
> >>
> >> On Fri, Aug 3, 2012 at 9:54 PM, Bijeet Singh <bi...@gmail.com>
> >> wrote:
> >>
> >>> Well, if the file that you have contains TSV, you can directly use the
> >>> ImportTSV utility of HBase to do a bulk load.
> >>> More details about that can be found here :
> >>>
> >>> http://hbase.apache.org/book/ops_mgt.html#importtsv
> >>>
> >>> The other option for you is to run a MR job on the file that you have,
> to
> >>> generate the HFiles, which you can later import
> >>> to HBase using completebulkload.  HFiles are created using the
> >>> HFileOutputFormat class.The output of Map should
> >>> be Put or KeyValue. For Reduce you need to use configureIncrementalLoad
> >>> which sets up reduce tasks.
> >>>
> >>> Bijeet
> >>>
> >>>
> >>> On Sat, Aug 4, 2012 at 8:13 AM, Rita <rm...@gmail.com> wrote:
> >>>
> >>>> I have a file which has 13 billion rows of key an value which I would
> >>> like
> >>>> to place in Hbase. I was wondering if anyone has a good example to
> >>> provide
> >>>> with mapreduce for some sort of work like this.
> >>>>
> >>>>
> >>>> tia
> >>>>
> >>>>
> >>>> --
> >>>> --- Get your facts first, then you can distort them as you please.--
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Thanks & Regards,
> >> Anil Gupta
> >>
>

Re: adding data

Posted by Michel Segel <mi...@hotmail.com>.
Ok, a couple of things....

First, a contrarian piece of advice....
Don't base your performance tuning on your initial load, but on your system at its steady state.
It's a simple concept that people forget and it can cause problems down the road....

So we have two problems...

Rita w 13 billion rows and Hamed w 15,000 row inserts per second.

Both are distinct problems...

Rita, what constraints do you have? Have you thought about your schema? Have you thought about your region size? Have you tuned up HBase? How long do you have to load the data?
What is the growth and use of the data?
(these are pretty much the same for a DW,ODS, OLTP and NoSQL that and DBA would face.)

While you were already pointed to the bulk load, I thought you should also think about the other issues too.

Hamed, 
15k rows a second?

You have a slightly different problem. Rita asks about initial load. You have an issue with sustained input rate . 

You already see a problem with sequential keys... 
What is you planned access patterns? Size of the row, growth rate? Decay rate?
(do you even delete the data?)
Does the Schema make sense, or do you want to look at Asynchronous HBase?

Then there are other considerations...
Like your network and hardware...
What are you running on?
Memory, CPU, disk ... (ssd's?) 

A lot of unknown factors... So to help we're going to need more information....

 
Sent from a remote device. Please excuse any typos...

Mike Segel

On Aug 4, 2012, at 2:44 AM, Hamed Ghavamnia <gh...@gmail.com> wrote:

> Hi,
> I'm facing a somewhat similar problem. I need to insert 15000 rows per
> second into hbase. I'm getting really bad results using the simple put
> api's (with multithreading). I've tried map/reduce integration as well. The
> problem seems to be the type of the row keys. My row keys have an
> incremental type, which makes hbase store them in the same region and
> therefore on the same node. I've tried changing my keys to a more random
> type, but still hbase stores them in the same region.
> Any solutions would be appreciated, some things which have crossed my mind:
> 1. To presplit my regions, but I'm not sure if the problem has anything to
> do with the regions.
> 2. Use the bulk load stated in you emails, but I don't where to start from.
> Do you have a link to a sample code which can be used?
> Any ideas?
> 
> On Sat, Aug 4, 2012 at 10:09 AM, anil gupta <an...@gmail.com> wrote:
> 
>> Hi Rita,
>> 
>> HBase Bulk Loader is a viable solution for loading such huge data set. Even
>> if your import file has a separator other than tab you can use ImportTsv as
>> long as the separator is single character. If in case you want to put in
>> your business logic while writing the data to HBase then you can write your
>> own mapper class and use it with bulk loader. Hence, you can heavily
>> customize the bulk loader as per your needs.
>> These links might be helpful for you:
>> http://hbase.apache.org/book.html#arch.bulk.load
>> http://bigdatanoob.blogspot.com/2012/03/bulk-load-csv-file-into-hbase.html
>> 
>> HTH,
>> Anil Gupta
>> 
>> On Fri, Aug 3, 2012 at 9:54 PM, Bijeet Singh <bi...@gmail.com>
>> wrote:
>> 
>>> Well, if the file that you have contains TSV, you can directly use the
>>> ImportTSV utility of HBase to do a bulk load.
>>> More details about that can be found here :
>>> 
>>> http://hbase.apache.org/book/ops_mgt.html#importtsv
>>> 
>>> The other option for you is to run a MR job on the file that you have, to
>>> generate the HFiles, which you can later import
>>> to HBase using completebulkload.  HFiles are created using the
>>> HFileOutputFormat class.The output of Map should
>>> be Put or KeyValue. For Reduce you need to use configureIncrementalLoad
>>> which sets up reduce tasks.
>>> 
>>> Bijeet
>>> 
>>> 
>>> On Sat, Aug 4, 2012 at 8:13 AM, Rita <rm...@gmail.com> wrote:
>>> 
>>>> I have a file which has 13 billion rows of key an value which I would
>>> like
>>>> to place in Hbase. I was wondering if anyone has a good example to
>>> provide
>>>> with mapreduce for some sort of work like this.
>>>> 
>>>> 
>>>> tia
>>>> 
>>>> 
>>>> --
>>>> --- Get your facts first, then you can distort them as you please.--
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Thanks & Regards,
>> Anil Gupta
>> 

Re: adding data

Posted by Hamed Ghavamnia <gh...@gmail.com>.
Hi,
I'm facing a somewhat similar problem. I need to insert 15000 rows per
second into hbase. I'm getting really bad results using the simple put
api's (with multithreading). I've tried map/reduce integration as well. The
problem seems to be the type of the row keys. My row keys have an
incremental type, which makes hbase store them in the same region and
therefore on the same node. I've tried changing my keys to a more random
type, but still hbase stores them in the same region.
Any solutions would be appreciated, some things which have crossed my mind:
1. To presplit my regions, but I'm not sure if the problem has anything to
do with the regions.
2. Use the bulk load stated in you emails, but I don't where to start from.
Do you have a link to a sample code which can be used?
Any ideas?

On Sat, Aug 4, 2012 at 10:09 AM, anil gupta <an...@gmail.com> wrote:

> Hi Rita,
>
> HBase Bulk Loader is a viable solution for loading such huge data set. Even
> if your import file has a separator other than tab you can use ImportTsv as
> long as the separator is single character. If in case you want to put in
> your business logic while writing the data to HBase then you can write your
> own mapper class and use it with bulk loader. Hence, you can heavily
> customize the bulk loader as per your needs.
> These links might be helpful for you:
> http://hbase.apache.org/book.html#arch.bulk.load
> http://bigdatanoob.blogspot.com/2012/03/bulk-load-csv-file-into-hbase.html
>
> HTH,
> Anil Gupta
>
> On Fri, Aug 3, 2012 at 9:54 PM, Bijeet Singh <bi...@gmail.com>
> wrote:
>
> > Well, if the file that you have contains TSV, you can directly use the
> > ImportTSV utility of HBase to do a bulk load.
> > More details about that can be found here :
> >
> > http://hbase.apache.org/book/ops_mgt.html#importtsv
> >
> > The other option for you is to run a MR job on the file that you have, to
> > generate the HFiles, which you can later import
> > to HBase using completebulkload.  HFiles are created using the
> > HFileOutputFormat class.The output of Map should
> > be Put or KeyValue. For Reduce you need to use configureIncrementalLoad
> > which sets up reduce tasks.
> >
> > Bijeet
> >
> >
> > On Sat, Aug 4, 2012 at 8:13 AM, Rita <rm...@gmail.com> wrote:
> >
> > > I have a file which has 13 billion rows of key an value which I would
> > like
> > > to place in Hbase. I was wondering if anyone has a good example to
> > provide
> > > with mapreduce for some sort of work like this.
> > >
> > >
> > > tia
> > >
> > >
> > > --
> > > --- Get your facts first, then you can distort them as you please.--
> > >
> >
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

Re: adding data

Posted by anil gupta <an...@gmail.com>.
Hi Rita,

HBase Bulk Loader is a viable solution for loading such huge data set. Even
if your import file has a separator other than tab you can use ImportTsv as
long as the separator is single character. If in case you want to put in
your business logic while writing the data to HBase then you can write your
own mapper class and use it with bulk loader. Hence, you can heavily
customize the bulk loader as per your needs.
These links might be helpful for you:
http://hbase.apache.org/book.html#arch.bulk.load
http://bigdatanoob.blogspot.com/2012/03/bulk-load-csv-file-into-hbase.html

HTH,
Anil Gupta

On Fri, Aug 3, 2012 at 9:54 PM, Bijeet Singh <bi...@gmail.com> wrote:

> Well, if the file that you have contains TSV, you can directly use the
> ImportTSV utility of HBase to do a bulk load.
> More details about that can be found here :
>
> http://hbase.apache.org/book/ops_mgt.html#importtsv
>
> The other option for you is to run a MR job on the file that you have, to
> generate the HFiles, which you can later import
> to HBase using completebulkload.  HFiles are created using the
> HFileOutputFormat class.The output of Map should
> be Put or KeyValue. For Reduce you need to use configureIncrementalLoad
> which sets up reduce tasks.
>
> Bijeet
>
>
> On Sat, Aug 4, 2012 at 8:13 AM, Rita <rm...@gmail.com> wrote:
>
> > I have a file which has 13 billion rows of key an value which I would
> like
> > to place in Hbase. I was wondering if anyone has a good example to
> provide
> > with mapreduce for some sort of work like this.
> >
> >
> > tia
> >
> >
> > --
> > --- Get your facts first, then you can distort them as you please.--
> >
>



-- 
Thanks & Regards,
Anil Gupta

Re: adding data

Posted by Bijeet Singh <bi...@gmail.com>.
Well, if the file that you have contains TSV, you can directly use the
ImportTSV utility of HBase to do a bulk load.
More details about that can be found here :

http://hbase.apache.org/book/ops_mgt.html#importtsv

The other option for you is to run a MR job on the file that you have, to
generate the HFiles, which you can later import
to HBase using completebulkload.  HFiles are created using the
HFileOutputFormat class.The output of Map should
be Put or KeyValue. For Reduce you need to use configureIncrementalLoad
which sets up reduce tasks.

Bijeet


On Sat, Aug 4, 2012 at 8:13 AM, Rita <rm...@gmail.com> wrote:

> I have a file which has 13 billion rows of key an value which I would like
> to place in Hbase. I was wondering if anyone has a good example to provide
> with mapreduce for some sort of work like this.
>
>
> tia
>
>
> --
> --- Get your facts first, then you can distort them as you please.--
>