You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Gaurav Vashishth <va...@gmail.com> on 2010/01/18 11:13:37 UTC

HBase Insert Performance

I need to store live data which is about 40-50K records /sec, evaluated MYSql
and now trying  HBase.

Just read in docstoc that HBase insert performance, for few 1000 rows and 10
columns with 1 MB values, is 68ms/row. My scenario is similar, we need under
10k rows, 10-20 columns and which can have thousands of version with values
not greater than 300 bytes. Initially, I thought HBase can solve the puprose
but reading docstoc article have put doubt in my mind.

Can we get 40-50k records/sec insertion speed in HBase?? Also, there would
be thousand of users who will be reading teh database also, can HBase
maintain that much of speed?

Thanks
Gaurav
-- 
View this message in context: http://old.nabble.com/HBase-Insert-Performance-tp27208387p27208387.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HBase Insert Performance

Posted by Patrick Hunt <ph...@apache.org>.

In general when determining the number of ZooKeeper serving nodes to 
deploy (the size of an ensemble) you need to think in terms of 
reliability, and not performance.

Reliability:

A single ZooKeeper server (standalone) is essentially a coordinator with 
no reliability (a single serving node failure brings down the ZK service).

A 3 server ensemble (you need to jump to 3 and not 2 because ZK works 
based on simple majority voting) allows for a single server to fail and 
the service will still be available.

So if you want reliability go with at least 3. We typically recommend 
having 5 servers in "online" production serving environments. This 
allows you to take 1 server out of service (say planned maintenance) and 
still be able to sustain an unexpected outage of one of the remaining 
servers w/o interruption of the service.

Performance:

Write performance actually _decreases_ as you add ZK servers, while read 
performance increases modestly: http://bit.ly/9JEUju

See this page for a recent survey I did looking at operational latency 
with both standalone server and an ensemble of size 3: 
http://bit.ly/4ekN8G You'll notice that a single core machine running a 
standalone ZK ensemble (1 server) is still able to process 15k requests 
per second. This is orders of magnitude greater than what hbase 
currently uses ZK for (may change in future). (background: 
http://bit.ly/csQLQ5)

Patrick

Micha? Podsiad?owski wrote:
> Hey all,
> I was asking about minimum number of zookeepers and usually everybody was
> saying odd number >=3. Are there any reasons for this. Have you encounter
> any problems from single zookeeper?  As far as know already hbase is doing
> very very little operations using zookeeper so load on it is insignificant.
> If I have only one master and one namenode i do have 2 SPOF so another one
> is not a big deal.  Currently we have 3 zookeepers running on xen os with
> datanode/hregion on physical machine.
> Can someone advice something?
> 
> Thanks,
> Michal
>

Re: HBase Insert Performance

Posted by Jean-Daniel Cryans <jd...@apache.org>.

If you have 1 cluster and it's very small, as you point out HBase isn't
intense on ZK (yet) so using only 1 ZK is ok.

Another setup like we have here at stumbleupon is multiple clusters using
the same quorum. In this case it makes sense to get 3 or 5 nodes and in our
case the hardware is beefy enough so that they coexist with some slave
processes.

J-D

2010/2/12 Michał Podsiadłowski <po...@gmail.com>

> Hey all,
> I was asking about minimum number of zookeepers and usually everybody was
> saying odd number >=3. Are there any reasons for this. Have you encounter
> any problems from single zookeeper?  As far as know already hbase is doing
> very very little operations using zookeeper so load on it is insignificant.
> If I have only one master and one namenode i do have 2 SPOF so another one
> is not a big deal.  Currently we have 3 zookeepers running on xen os with
> datanode/hregion on physical machine.
> Can someone advice something?
>
> Thanks,
> Michal
>

Re: HBase Insert Performance

Posted by Michał Podsiadłowski <po...@gmail.com>.

Hey all,
I was asking about minimum number of zookeepers and usually everybody was
saying odd number >=3. Are there any reasons for this. Have you encounter
any problems from single zookeeper?  As far as know already hbase is doing
very very little operations using zookeeper so load on it is insignificant.
If I have only one master and one namenode i do have 2 SPOF so another one
is not a big deal.  Currently we have 3 zookeepers running on xen os with
datanode/hregion on physical machine.
Can someone advice something?

Thanks,
Michal

Re: HBase Insert Performance

Posted by Gaurav Vashishth <va...@gmail.com>.

Ryan, 

I have setup the custer as suggested by you. Now I have Master,namemode and
zookeeper on same machine and have 8 region servers running as data nodes
and with this configuration I was able to get the insertion speed of around
18K records/sec. Though Im still using 4GB ram, will upgrade it also and I
hope adding more region servers will increase the insertion speed 

Thanks,

Gaurav


Ryan Rawson wrote:
> 
> Hey,
> 
> So there are 2 major problems here:
> - the setup is way off. There is no actual data duplication for
> example, you will put every write to 1 machine, which when it fails,
> so goes your data.
> - These machines don't have enough ram. They must have at least
> 1gb/core, ideally 2gb/core or more.  This means they should have 8 gb
> ram.  crucial.com
> 
> A better setup would be:
> - 1 "master" node, runs: hmaster, 1xzookeeper, namenode
> - 5 data/regionservers
> 
> The key here to performance is to spread your workload over more
> machines.  This is how clustered software works in a nutshell.  using
> only 1/3 of your machines for "regionservers" and 1/6th for data
> storage (datanode) is non-ideal.
> 
> You really need to up the ram.  I run:
> - dual quad i7s with hyper-threading, which gives 16 cores to the OS
> - 24 gb ram
> - 4 x 1tb disk
> 
> My small end machines are:
> - dual quad xeons, 8 cores to the OS
> - 16 gb ram
> - 2 x 1tb disk
> 
> For performance you really dont want to have less than 1-2gb ram per
> core. Without a lot of ram, you don't get effective disk caching. You
> can't run map-reduces on the same nodes, you may run into swap issues,
> etc.  4 gb ddr3 ram is about $150 usd.
> 
> But given a reasonable machine set, doing 50k inserts/sec sustained
> over long periods of time is totally doable. You will need more than 6
> machines though! Don't forget your spares, since you really want to be
> able to operate on N-{1,2} machines so failures don't cripple you.
> 
> 
> 
> On Mon, Jan 18, 2010 at 2:55 AM, Gaurav Vashishth <va...@gmail.com>
> wrote:
>>
>> Using 6 machines, 8 core with 4 GB Ram, right now for setting up the
>> scenario.
>>
>> 2 region servers
>> 1 ZooKeeper
>> 1 Data Node
>> 2 Name Node
>>
>>
>>
>> Ryan Rawson wrote:
>>>
>>> How many machines do you have? I'd try at least 20+ late model boxes.
>>>
>>> On Jan 18, 2010 2:14 AM, "Gaurav Vashishth" <va...@gmail.com>
>>> wrote:
>>>
>>>
>>> I need to store live data which is about 40-50K records /sec, evaluated
>>> MYSql
>>> and now trying  HBase.
>>>
>>> Just read in docstoc that HBase insert performance, for few 1000 rows
>>> and
>>> 10
>>> columns with 1 MB values, is 68ms/row. My scenario is similar, we need
>>> under
>>> 10k rows, 10-20 columns and which can have thousands of version with
>>> values
>>> not greater than 300 bytes. Initially, I thought HBase can solve the
>>> puprose
>>> but reading docstoc article have put doubt in my mind.
>>>
>>> Can we get 40-50k records/sec insertion speed in HBase?? Also, there
>>> would
>>> be thousand of users who will be reading teh database also, can HBase
>>> maintain that much of speed?
>>>
>>> Thanks
>>> Gaurav
>>> --
>>> View this message in context:
>>> http://old.nabble.com/HBase-Insert-Performance-tp27208387p27208387.html
>>> Sent from the HBase User mailing list archive at Nabble.com.
>>>
>>>
>>
>> --
>> View this message in context:
>> http://old.nabble.com/HBase-Insert-Performance-tp27208387p27208828.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/HBase-Insert-Performance-tp27208387p27562803.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HBase Insert Performance

Posted by Paul Ambrose <pa...@mac.com>.

When I run my test suite, I am seeing incorrect results from  HBaseAdmin.tableExists() in both 
candidate 1 and candidate 2.  It is sometimes returning false when it should return true. 
If I revert to 0.20.2, the tests run cleanly.  

Paul


On Jan 18, 2010, at 9:57 AM, Jean-Daniel Cryans wrote:

> I think this is https://issues.apache.org/jira/browse/HBASE-2035 fixed
> in the upcoming 0.20.3. If you want to try it out, get the RC2 here
> http://people.apache.org/~jdcryans/hbase-0.20.3-candidate-2/
> 
> J-D
> 
> On Mon, Jan 18, 2010 at 3:29 AM, Gaurav Vashishth <va...@gmail.com> wrote:
>> 
>> Thanks a lot, your words have encouraged me that it is doable, will upgrade
>> the system and re run the test case.
>> 
>> Though, I have one more query
>> 
>> When I insert the records in HBase through Put command, I send the row id as
>> long value like "80760057" but when I run the HBase through Shell and scan
>> the table I always see the value in
>> \000\000\000\000\000\n\005+, this format. Also, I cann't get the value
>> through this row id despite of that column qualifier has the values.
>> 
>> 
>> 
>> Ryan Rawson wrote:
>>> 
>>> Hey,
>>> 
>>> So there are 2 major problems here:
>>> - the setup is way off. There is no actual data duplication for
>>> example, you will put every write to 1 machine, which when it fails,
>>> so goes your data.
>>> - These machines don't have enough ram. They must have at least
>>> 1gb/core, ideally 2gb/core or more.  This means they should have 8 gb
>>> ram.  crucial.com
>>> 
>>> A better setup would be:
>>> - 1 "master" node, runs: hmaster, 1xzookeeper, namenode
>>> - 5 data/regionservers
>>> 
>>> The key here to performance is to spread your workload over more
>>> machines.  This is how clustered software works in a nutshell.  using
>>> only 1/3 of your machines for "regionservers" and 1/6th for data
>>> storage (datanode) is non-ideal.
>>> 
>>> You really need to up the ram.  I run:
>>> - dual quad i7s with hyper-threading, which gives 16 cores to the OS
>>> - 24 gb ram
>>> - 4 x 1tb disk
>>> 
>>> My small end machines are:
>>> - dual quad xeons, 8 cores to the OS
>>> - 16 gb ram
>>> - 2 x 1tb disk
>>> 
>>> For performance you really dont want to have less than 1-2gb ram per
>>> core. Without a lot of ram, you don't get effective disk caching. You
>>> can't run map-reduces on the same nodes, you may run into swap issues,
>>> etc.  4 gb ddr3 ram is about $150 usd.
>>> 
>>> But given a reasonable machine set, doing 50k inserts/sec sustained
>>> over long periods of time is totally doable. You will need more than 6
>>> machines though! Don't forget your spares, since you really want to be
>>> able to operate on N-{1,2} machines so failures don't cripple you.
>>> 
>>> 
>>> 
>>> On Mon, Jan 18, 2010 at 2:55 AM, Gaurav Vashishth <va...@gmail.com>
>>> wrote:
>>>> 
>>>> Using 6 machines, 8 core with 4 GB Ram, right now for setting up the
>>>> scenario.
>>>> 
>>>> 2 region servers
>>>> 1 ZooKeeper
>>>> 1 Data Node
>>>> 2 Name Node
>>>> 
>>>> 
>>>> 
>>>> Ryan Rawson wrote:
>>>>> 
>>>>> How many machines do you have? I'd try at least 20+ late model boxes.
>>>>> 
>>>>> On Jan 18, 2010 2:14 AM, "Gaurav Vashishth" <va...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>> 
>>>>> I need to store live data which is about 40-50K records /sec, evaluated
>>>>> MYSql
>>>>> and now trying  HBase.
>>>>> 
>>>>> Just read in docstoc that HBase insert performance, for few 1000 rows
>>>>> and
>>>>> 10
>>>>> columns with 1 MB values, is 68ms/row. My scenario is similar, we need
>>>>> under
>>>>> 10k rows, 10-20 columns and which can have thousands of version with
>>>>> values
>>>>> not greater than 300 bytes. Initially, I thought HBase can solve the
>>>>> puprose
>>>>> but reading docstoc article have put doubt in my mind.
>>>>> 
>>>>> Can we get 40-50k records/sec insertion speed in HBase?? Also, there
>>>>> would
>>>>> be thousand of users who will be reading teh database also, can HBase
>>>>> maintain that much of speed?
>>>>> 
>>>>> Thanks
>>>>> Gaurav
>>>>> --
>>>>> View this message in context:
>>>>> http://old.nabble.com/HBase-Insert-Performance-tp27208387p27208387.html
>>>>> Sent from the HBase User mailing list archive at Nabble.com.
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> View this message in context:
>>>> http://old.nabble.com/HBase-Insert-Performance-tp27208387p27208828.html
>>>> Sent from the HBase User mailing list archive at Nabble.com.
>>>> 
>>>> 
>>> 
>>> 
>> 
>> --
>> View this message in context: http://old.nabble.com/HBase-Insert-Performance-tp27208387p27209231.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>> 
>>

Re: HBase Insert Performance

Posted by Gaurav Vashishth <va...@gmail.com>.

Thanks, will try this new version

-Gaurav


Jean-Daniel Cryans-2 wrote:
> 
> I think this is https://issues.apache.org/jira/browse/HBASE-2035 fixed
> in the upcoming 0.20.3. If you want to try it out, get the RC2 here
> http://people.apache.org/~jdcryans/hbase-0.20.3-candidate-2/
> 
> J-D
> 
> On Mon, Jan 18, 2010 at 3:29 AM, Gaurav Vashishth <va...@gmail.com>
> wrote:
>>
>> Thanks a lot, your words have encouraged me that it is doable, will
>> upgrade
>> the system and re run the test case.
>>
>> Though, I have one more query
>>
>> When I insert the records in HBase through Put command, I send the row id
>> as
>> long value like "80760057" but when I run the HBase through Shell and
>> scan
>> the table I always see the value in
>> \000\000\000\000\000\n\005+, this format. Also, I cann't get the value
>> through this row id despite of that column qualifier has the values.
>>
>>
>>
>> Ryan Rawson wrote:
>>>
>>> Hey,
>>>
>>> So there are 2 major problems here:
>>> - the setup is way off. There is no actual data duplication for
>>> example, you will put every write to 1 machine, which when it fails,
>>> so goes your data.
>>> - These machines don't have enough ram. They must have at least
>>> 1gb/core, ideally 2gb/core or more.  This means they should have 8 gb
>>> ram.  crucial.com
>>>
>>> A better setup would be:
>>> - 1 "master" node, runs: hmaster, 1xzookeeper, namenode
>>> - 5 data/regionservers
>>>
>>> The key here to performance is to spread your workload over more
>>> machines.  This is how clustered software works in a nutshell.  using
>>> only 1/3 of your machines for "regionservers" and 1/6th for data
>>> storage (datanode) is non-ideal.
>>>
>>> You really need to up the ram.  I run:
>>> - dual quad i7s with hyper-threading, which gives 16 cores to the OS
>>> - 24 gb ram
>>> - 4 x 1tb disk
>>>
>>> My small end machines are:
>>> - dual quad xeons, 8 cores to the OS
>>> - 16 gb ram
>>> - 2 x 1tb disk
>>>
>>> For performance you really dont want to have less than 1-2gb ram per
>>> core. Without a lot of ram, you don't get effective disk caching. You
>>> can't run map-reduces on the same nodes, you may run into swap issues,
>>> etc.  4 gb ddr3 ram is about $150 usd.
>>>
>>> But given a reasonable machine set, doing 50k inserts/sec sustained
>>> over long periods of time is totally doable. You will need more than 6
>>> machines though! Don't forget your spares, since you really want to be
>>> able to operate on N-{1,2} machines so failures don't cripple you.
>>>
>>>
>>>
>>> On Mon, Jan 18, 2010 at 2:55 AM, Gaurav Vashishth <va...@gmail.com>
>>> wrote:
>>>>
>>>> Using 6 machines, 8 core with 4 GB Ram, right now for setting up the
>>>> scenario.
>>>>
>>>> 2 region servers
>>>> 1 ZooKeeper
>>>> 1 Data Node
>>>> 2 Name Node
>>>>
>>>>
>>>>
>>>> Ryan Rawson wrote:
>>>>>
>>>>> How many machines do you have? I'd try at least 20+ late model boxes.
>>>>>
>>>>> On Jan 18, 2010 2:14 AM, "Gaurav Vashishth" <va...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>> I need to store live data which is about 40-50K records /sec,
>>>>> evaluated
>>>>> MYSql
>>>>> and now trying  HBase.
>>>>>
>>>>> Just read in docstoc that HBase insert performance, for few 1000 rows
>>>>> and
>>>>> 10
>>>>> columns with 1 MB values, is 68ms/row. My scenario is similar, we need
>>>>> under
>>>>> 10k rows, 10-20 columns and which can have thousands of version with
>>>>> values
>>>>> not greater than 300 bytes. Initially, I thought HBase can solve the
>>>>> puprose
>>>>> but reading docstoc article have put doubt in my mind.
>>>>>
>>>>> Can we get 40-50k records/sec insertion speed in HBase?? Also, there
>>>>> would
>>>>> be thousand of users who will be reading teh database also, can HBase
>>>>> maintain that much of speed?
>>>>>
>>>>> Thanks
>>>>> Gaurav
>>>>> --
>>>>> View this message in context:
>>>>> http://old.nabble.com/HBase-Insert-Performance-tp27208387p27208387.html
>>>>> Sent from the HBase User mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://old.nabble.com/HBase-Insert-Performance-tp27208387p27208828.html
>>>> Sent from the HBase User mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://old.nabble.com/HBase-Insert-Performance-tp27208387p27209231.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/HBase-Insert-Performance-tp27208387p27215054.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HBase Insert Performance

Posted by Jean-Daniel Cryans <jd...@apache.org>.

I think this is https://issues.apache.org/jira/browse/HBASE-2035 fixed
in the upcoming 0.20.3. If you want to try it out, get the RC2 here
http://people.apache.org/~jdcryans/hbase-0.20.3-candidate-2/

J-D

On Mon, Jan 18, 2010 at 3:29 AM, Gaurav Vashishth <va...@gmail.com> wrote:
>
> Thanks a lot, your words have encouraged me that it is doable, will upgrade
> the system and re run the test case.
>
> Though, I have one more query
>
> When I insert the records in HBase through Put command, I send the row id as
> long value like "80760057" but when I run the HBase through Shell and scan
> the table I always see the value in
> \000\000\000\000\000\n\005+, this format. Also, I cann't get the value
> through this row id despite of that column qualifier has the values.
>
>
>
> Ryan Rawson wrote:
>>
>> Hey,
>>
>> So there are 2 major problems here:
>> - the setup is way off. There is no actual data duplication for
>> example, you will put every write to 1 machine, which when it fails,
>> so goes your data.
>> - These machines don't have enough ram. They must have at least
>> 1gb/core, ideally 2gb/core or more.  This means they should have 8 gb
>> ram.  crucial.com
>>
>> A better setup would be:
>> - 1 "master" node, runs: hmaster, 1xzookeeper, namenode
>> - 5 data/regionservers
>>
>> The key here to performance is to spread your workload over more
>> machines.  This is how clustered software works in a nutshell.  using
>> only 1/3 of your machines for "regionservers" and 1/6th for data
>> storage (datanode) is non-ideal.
>>
>> You really need to up the ram.  I run:
>> - dual quad i7s with hyper-threading, which gives 16 cores to the OS
>> - 24 gb ram
>> - 4 x 1tb disk
>>
>> My small end machines are:
>> - dual quad xeons, 8 cores to the OS
>> - 16 gb ram
>> - 2 x 1tb disk
>>
>> For performance you really dont want to have less than 1-2gb ram per
>> core. Without a lot of ram, you don't get effective disk caching. You
>> can't run map-reduces on the same nodes, you may run into swap issues,
>> etc.  4 gb ddr3 ram is about $150 usd.
>>
>> But given a reasonable machine set, doing 50k inserts/sec sustained
>> over long periods of time is totally doable. You will need more than 6
>> machines though! Don't forget your spares, since you really want to be
>> able to operate on N-{1,2} machines so failures don't cripple you.
>>
>>
>>
>> On Mon, Jan 18, 2010 at 2:55 AM, Gaurav Vashishth <va...@gmail.com>
>> wrote:
>>>
>>> Using 6 machines, 8 core with 4 GB Ram, right now for setting up the
>>> scenario.
>>>
>>> 2 region servers
>>> 1 ZooKeeper
>>> 1 Data Node
>>> 2 Name Node
>>>
>>>
>>>
>>> Ryan Rawson wrote:
>>>>
>>>> How many machines do you have? I'd try at least 20+ late model boxes.
>>>>
>>>> On Jan 18, 2010 2:14 AM, "Gaurav Vashishth" <va...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>> I need to store live data which is about 40-50K records /sec, evaluated
>>>> MYSql
>>>> and now trying  HBase.
>>>>
>>>> Just read in docstoc that HBase insert performance, for few 1000 rows
>>>> and
>>>> 10
>>>> columns with 1 MB values, is 68ms/row. My scenario is similar, we need
>>>> under
>>>> 10k rows, 10-20 columns and which can have thousands of version with
>>>> values
>>>> not greater than 300 bytes. Initially, I thought HBase can solve the
>>>> puprose
>>>> but reading docstoc article have put doubt in my mind.
>>>>
>>>> Can we get 40-50k records/sec insertion speed in HBase?? Also, there
>>>> would
>>>> be thousand of users who will be reading teh database also, can HBase
>>>> maintain that much of speed?
>>>>
>>>> Thanks
>>>> Gaurav
>>>> --
>>>> View this message in context:
>>>> http://old.nabble.com/HBase-Insert-Performance-tp27208387p27208387.html
>>>> Sent from the HBase User mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://old.nabble.com/HBase-Insert-Performance-tp27208387p27208828.html
>>> Sent from the HBase User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> --
> View this message in context: http://old.nabble.com/HBase-Insert-Performance-tp27208387p27209231.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>

Re: HBase Insert Performance

Posted by Gaurav Vashishth <va...@gmail.com>.

Thanks a lot, your words have encouraged me that it is doable, will upgrade
the system and re run the test case. 

Though, I have one more query

When I insert the records in HBase through Put command, I send the row id as
long value like "80760057" but when I run the HBase through Shell and scan
the table I always see the value in
\000\000\000\000\000\n\005+, this format. Also, I cann't get the value
through this row id despite of that column qualifier has the values.



Ryan Rawson wrote:
> 
> Hey,
> 
> So there are 2 major problems here:
> - the setup is way off. There is no actual data duplication for
> example, you will put every write to 1 machine, which when it fails,
> so goes your data.
> - These machines don't have enough ram. They must have at least
> 1gb/core, ideally 2gb/core or more.  This means they should have 8 gb
> ram.  crucial.com
> 
> A better setup would be:
> - 1 "master" node, runs: hmaster, 1xzookeeper, namenode
> - 5 data/regionservers
> 
> The key here to performance is to spread your workload over more
> machines.  This is how clustered software works in a nutshell.  using
> only 1/3 of your machines for "regionservers" and 1/6th for data
> storage (datanode) is non-ideal.
> 
> You really need to up the ram.  I run:
> - dual quad i7s with hyper-threading, which gives 16 cores to the OS
> - 24 gb ram
> - 4 x 1tb disk
> 
> My small end machines are:
> - dual quad xeons, 8 cores to the OS
> - 16 gb ram
> - 2 x 1tb disk
> 
> For performance you really dont want to have less than 1-2gb ram per
> core. Without a lot of ram, you don't get effective disk caching. You
> can't run map-reduces on the same nodes, you may run into swap issues,
> etc.  4 gb ddr3 ram is about $150 usd.
> 
> But given a reasonable machine set, doing 50k inserts/sec sustained
> over long periods of time is totally doable. You will need more than 6
> machines though! Don't forget your spares, since you really want to be
> able to operate on N-{1,2} machines so failures don't cripple you.
> 
> 
> 
> On Mon, Jan 18, 2010 at 2:55 AM, Gaurav Vashishth <va...@gmail.com>
> wrote:
>>
>> Using 6 machines, 8 core with 4 GB Ram, right now for setting up the
>> scenario.
>>
>> 2 region servers
>> 1 ZooKeeper
>> 1 Data Node
>> 2 Name Node
>>
>>
>>
>> Ryan Rawson wrote:
>>>
>>> How many machines do you have? I'd try at least 20+ late model boxes.
>>>
>>> On Jan 18, 2010 2:14 AM, "Gaurav Vashishth" <va...@gmail.com>
>>> wrote:
>>>
>>>
>>> I need to store live data which is about 40-50K records /sec, evaluated
>>> MYSql
>>> and now trying  HBase.
>>>
>>> Just read in docstoc that HBase insert performance, for few 1000 rows
>>> and
>>> 10
>>> columns with 1 MB values, is 68ms/row. My scenario is similar, we need
>>> under
>>> 10k rows, 10-20 columns and which can have thousands of version with
>>> values
>>> not greater than 300 bytes. Initially, I thought HBase can solve the
>>> puprose
>>> but reading docstoc article have put doubt in my mind.
>>>
>>> Can we get 40-50k records/sec insertion speed in HBase?? Also, there
>>> would
>>> be thousand of users who will be reading teh database also, can HBase
>>> maintain that much of speed?
>>>
>>> Thanks
>>> Gaurav
>>> --
>>> View this message in context:
>>> http://old.nabble.com/HBase-Insert-Performance-tp27208387p27208387.html
>>> Sent from the HBase User mailing list archive at Nabble.com.
>>>
>>>
>>
>> --
>> View this message in context:
>> http://old.nabble.com/HBase-Insert-Performance-tp27208387p27208828.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/HBase-Insert-Performance-tp27208387p27209231.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HBase Insert Performance

Posted by Ryan Rawson <ry...@gmail.com>.

Hey,

So there are 2 major problems here:
- the setup is way off. There is no actual data duplication for
example, you will put every write to 1 machine, which when it fails,
so goes your data.
- These machines don't have enough ram. They must have at least
1gb/core, ideally 2gb/core or more.  This means they should have 8 gb
ram.  crucial.com

A better setup would be:
- 1 "master" node, runs: hmaster, 1xzookeeper, namenode
- 5 data/regionservers

The key here to performance is to spread your workload over more
machines.  This is how clustered software works in a nutshell.  using
only 1/3 of your machines for "regionservers" and 1/6th for data
storage (datanode) is non-ideal.

You really need to up the ram.  I run:
- dual quad i7s with hyper-threading, which gives 16 cores to the OS
- 24 gb ram
- 4 x 1tb disk

My small end machines are:
- dual quad xeons, 8 cores to the OS
- 16 gb ram
- 2 x 1tb disk

For performance you really dont want to have less than 1-2gb ram per
core. Without a lot of ram, you don't get effective disk caching. You
can't run map-reduces on the same nodes, you may run into swap issues,
etc.  4 gb ddr3 ram is about $150 usd.

But given a reasonable machine set, doing 50k inserts/sec sustained
over long periods of time is totally doable. You will need more than 6
machines though! Don't forget your spares, since you really want to be
able to operate on N-{1,2} machines so failures don't cripple you.



On Mon, Jan 18, 2010 at 2:55 AM, Gaurav Vashishth <va...@gmail.com> wrote:
>
> Using 6 machines, 8 core with 4 GB Ram, right now for setting up the
> scenario.
>
> 2 region servers
> 1 ZooKeeper
> 1 Data Node
> 2 Name Node
>
>
>
> Ryan Rawson wrote:
>>
>> How many machines do you have? I'd try at least 20+ late model boxes.
>>
>> On Jan 18, 2010 2:14 AM, "Gaurav Vashishth" <va...@gmail.com> wrote:
>>
>>
>> I need to store live data which is about 40-50K records /sec, evaluated
>> MYSql
>> and now trying  HBase.
>>
>> Just read in docstoc that HBase insert performance, for few 1000 rows and
>> 10
>> columns with 1 MB values, is 68ms/row. My scenario is similar, we need
>> under
>> 10k rows, 10-20 columns and which can have thousands of version with
>> values
>> not greater than 300 bytes. Initially, I thought HBase can solve the
>> puprose
>> but reading docstoc article have put doubt in my mind.
>>
>> Can we get 40-50k records/sec insertion speed in HBase?? Also, there would
>> be thousand of users who will be reading teh database also, can HBase
>> maintain that much of speed?
>>
>> Thanks
>> Gaurav
>> --
>> View this message in context:
>> http://old.nabble.com/HBase-Insert-Performance-tp27208387p27208387.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>>
>>
>
> --
> View this message in context: http://old.nabble.com/HBase-Insert-Performance-tp27208387p27208828.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>

Re: HBase Insert Performance

Posted by Gaurav Vashishth <va...@gmail.com>.

Using 6 machines, 8 core with 4 GB Ram, right now for setting up the
scenario.

2 region servers
1 ZooKeeper
1 Data Node
2 Name Node



Ryan Rawson wrote:
> 
> How many machines do you have? I'd try at least 20+ late model boxes.
> 
> On Jan 18, 2010 2:14 AM, "Gaurav Vashishth" <va...@gmail.com> wrote:
> 
> 
> I need to store live data which is about 40-50K records /sec, evaluated
> MYSql
> and now trying  HBase.
> 
> Just read in docstoc that HBase insert performance, for few 1000 rows and
> 10
> columns with 1 MB values, is 68ms/row. My scenario is similar, we need
> under
> 10k rows, 10-20 columns and which can have thousands of version with
> values
> not greater than 300 bytes. Initially, I thought HBase can solve the
> puprose
> but reading docstoc article have put doubt in my mind.
> 
> Can we get 40-50k records/sec insertion speed in HBase?? Also, there would
> be thousand of users who will be reading teh database also, can HBase
> maintain that much of speed?
> 
> Thanks
> Gaurav
> --
> View this message in context:
> http://old.nabble.com/HBase-Insert-Performance-tp27208387p27208387.html
> Sent from the HBase User mailing list archive at Nabble.com.
> 
> 

-- 
View this message in context: http://old.nabble.com/HBase-Insert-Performance-tp27208387p27208828.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HBase Insert Performance

Posted by Ryan Rawson <ry...@gmail.com>.

How many machines do you have? I'd try at least 20+ late model boxes.

On Jan 18, 2010 2:14 AM, "Gaurav Vashishth" <va...@gmail.com> wrote:


I need to store live data which is about 40-50K records /sec, evaluated
MYSql
and now trying  HBase.

Just read in docstoc that HBase insert performance, for few 1000 rows and 10
columns with 1 MB values, is 68ms/row. My scenario is similar, we need under
10k rows, 10-20 columns and which can have thousands of version with values
not greater than 300 bytes. Initially, I thought HBase can solve the puprose
but reading docstoc article have put doubt in my mind.

Can we get 40-50k records/sec insertion speed in HBase?? Also, there would
be thousand of users who will be reading teh database also, can HBase
maintain that much of speed?

Thanks
Gaurav
--
View this message in context:
http://old.nabble.com/HBase-Insert-Performance-tp27208387p27208387.html
Sent from the HBase User mailing list archive at Nabble.com.