You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Hari Shankar <sh...@gmail.com> on 2010/11/02 13:31:00 UTC

Re: HBase not scaling well

Hi all,

      I ran the TestDFSIO job on my cluster and thought I'd append it
here in case it is of any help:

10/11/02 17:53:56 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : write
10/11/02 17:53:56 INFO mapred.FileInputFormat:            Date & time:
Tue Nov 02 17:53:56 IST 2010
10/11/02 17:53:56 INFO mapred.FileInputFormat:        Number of files: 10
10/11/02 17:53:56 INFO mapred.FileInputFormat: Total MBytes processed: 10000
10/11/02 17:53:56 INFO mapred.FileInputFormat:      Throughput mb/sec:
1.2372449326777915
10/11/02 17:53:56 INFO mapred.FileInputFormat: Average IO rate mb/sec:
1.2381720542907715
10/11/02 17:53:56 INFO mapred.FileInputFormat:  IO rate std deviation:
0.03402313342081011
10/11/02 17:53:56 INFO mapred.FileInputFormat:     Test exec time sec: 866.931
10/11/02 17:53:56 INFO mapred.FileInputFormat:

10/11/02 17:59:35 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : read
10/11/02 17:59:35 INFO mapred.FileInputFormat:            Date & time:
Tue Nov 02 17:59:35 IST 2010
10/11/02 17:59:35 INFO mapred.FileInputFormat:        Number of files: 10
10/11/02 17:59:35 INFO mapred.FileInputFormat: Total MBytes processed: 10000
10/11/02 17:59:35 INFO mapred.FileInputFormat:      Throughput mb/sec:
22.776708537849196
10/11/02 17:59:35 INFO mapred.FileInputFormat: Average IO rate mb/sec:
28.383480072021484
10/11/02 17:59:35 INFO mapred.FileInputFormat:  IO rate std deviation:
12.521607590777203
10/11/02 17:59:35 INFO mapred.FileInputFormat:     Test exec time sec: 108.735
10/11/02 17:59:35 INFO mapred.FileInputFormat:

For a 3 node cluster, is this good/bad/ugly..? Where can I find data
to compare my cluster regarding such parameters?

Thanks,
Hari



On Sun, Oct 31, 2010 at 11:59 PM, Hari Shankar <sh...@gmail.com> wrote:
> Thanks guys for the replies, and very sorry for the late reply. We are
> quite new to linux environment... our production servers are currently
> running on windows and our linux sysadmin is yet to arrive. So please
> forgive my ignorance regarding linux tools. Very little prior
> experience in linux. All our 3 nodes are running on different linux
> distros - one on ubuntu server 10.10, one on CentOS and one on
> Ubuntu-desktop 10.04. All have the same directory structure and same
> versions of hadoop, hbase and java though. Let me know if you think
> this could be an issue. Basically we wanted to evaluate all three
> distros at the same time as well. I hope than plan didn't backfire.
>
> Back to the problem at hand, here are the iptraf, htop and iostat reports:
>
> iptraf snapshot --master
>
> Total rates:
> 165424.2 kbits/s
> 3800 packets/s
>
> Incoming:
> 109415.3 kbits/s
> 3007.4 packets/s
>
> iptraf snapshot --slave01
>
> Total rates:
> 102024 kbits/s
> 3128 packets/s
>
> Incoming:
> 48755.9 kbits/s
> 1784 packets/s
>
> iostat --master
>
> Linux 2.6.32-21-generic (hadoop1)       Sunday 31 October 2010  _x86_64_        (4 CPU)
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                    0.54    0.01      0.18     0.30         0.00   98.97
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda               2.43       123.11       412.93   33988462  114000368
>
> iostat --slave01
>
> Linux 2.6.35-22-server (hadoop2)        Sunday 31 October 2010  _x86_64_        (4 CPU)
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                   0.77    0.00    0.29    0.18    0.00   98.77
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda               3.90       277.19      1228.97  245515598 1088525808
>
>
> iostat --slave02
>
> Linux 2.6.18-194.11.1.el5 (slave)       10/31/2010
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                  0.54    0.00             0.29    0.80    0.00   98.37
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda               6.57       302.09      1652.06  321914364 1760497088
> sda1              0.00         0.00         0.00       2458         88
> sda2              6.57       302.08      1652.06  321911602 1760497000
> dm-0            209.33       302.08      1652.06  321910322 1760496272
> dm-1           <F5>   0.00         0.00         0.00        896        728
>
> htop --master
> http://imgur.com/3zTu7
>
> htop --slave01
> http://imgur.com/5HeyF
>
> htop --slave02
> http://imgur.com/lHin7
>
> I hope these are the reports that you were referring to. Please let me
> otherwise. Also, is there an easier command-line way of fetching the
> iptraf and htop reports? master is running ubuntu desktop, slave01
> runs ubuntu server and slave02 runs CentOS.
> Some more facts that I have noticed:
> - I ran the job just now on the cluster after reformatting the
> namenode and it took only 1 hr 15 mins instead of the usual 2 hrs,
> though still slower than the single node config (30-40 mins). Can it
> be that it is faster right after a namenode format?
> - The time set on one of the slaves was incorrect and it lagged by 4
> hrs compared to the other two machines. I corrected the time before
> formatting the namenode this time. I wonder if that could have an
> impact.
> - I have ZK running on all 3 machines. Shouldn't it work fine if I
> just set up ZK on one of the nodes. In that case, I get a weird error:
> could not connect to port 0::0::0::0::....:2181 or something of that
> sort. I'll post the full error next time I see it.
> - The CentOS machine (slave02) seems to use a lot more CPU than the
> other 2 guys on an average. CPU usage in centos hovers around 50-60%
> mostly whereas it is more like 30-40% on the other 2 machines. (ref.
> htop screenshots above).
> - One a single-node configuration, moving from a 4 GB-RAM dual core
> laptop to an 8 GB-quad core machine gives a 1.8x performance
> improvement.
> - Increasing the child task heap size from the default 200 MB to 768
> MB improved performance on both single and multi node clusters by 100%
> (2x improvement). But going beyond 768 MB doesn't seem to have much
> impact.
>
> Michael and Jonathan, I think I have covered most of the info you guys
> had asked for as well above. It doesn't seem to be swapping, and yes,
> currently we are running all thoise processes on the master, and all
> processes minus namenode, secondary namenode and JT on the slaves. But
> we run all those processes on a single machine in case of single node
> as well, right? So if RAM/Swap was the culprit, shouldn't it effect
> single-node config more?
>
> Do let me know if anything is missing or you think more info would
> help. Many thanks for your time and patience. :)
>
> Thanks,
> Hari
>
> On Fri, Oct 29, 2010 at 9:51 PM, Jonathan Gray <jg...@facebook.com> wrote:
>> Going from pseudo-distributed mode to a 3 node setup is definitely not "scaling" in a real way and I would expect performance degradation.  Most especially when you're also running at replication factor 3 and in a setup where the master node is also acting as a slave node and MR task node.
>>
>> You're adding an entirely new layer (HDFS) which will always cause increased latency/decreased throughput, and then you're running on 3 nodes with a replication factor of 3.  So now every write is going to all three nodes, via HDFS, rather than a single node straight to the FS.
>>
>> You said that "all parts should ideally be available on all nodes", but this is a write test?  So that's a bad thing not a good thing.
>>
>> I would expect about a 50% slowdown but you're seeing more like 75% slowdown.  Not so out of the ordinary still.  Stuffing a NN, DN, JT, TT, HMaster, and RS onto a single node is not a great idea.  And then you're running 4 simultaneous tasks on a 4 core machine (along with these 6 other processes in the case of the master node).
>>
>> How many disks do each of your nodes have?
>>
>> If you really want to "scale" HBase, you're going to need more nodes.  I've seen some success at a 5 node level but generally 10 nodes and up is when HBase does well (and replication 3 makes sense).
>>
>> JG
>>
>>
>>> -----Original Message-----
>>> From: Michael Segel [mailto:michael_segel@hotmail.com]
>>> Sent: Friday, October 29, 2010 8:03 AM
>>> To: user@hbase.apache.org
>>> Subject: RE: HBase not scaling well
>>>
>>>
>>>
>>> I'd actually take a step back and ask what Hari is trying to do?
>>>
>>> Its difficult to figure out what the problem is when the OP says I've
>>> got code that works on individual psuedo mode, but not in an actual
>>> cluster.
>>> It would be nice to know version(s), configuration... 3 nodes... are
>>> they running ZK on the same machines that they are running Region
>>> Servers... Are they swapping? 8GB of memory can disappear quickly...
>>>
>>> Lots of questions...
>>>
>>>
>>> > From: clehene@adobe.com
>>> > To: user@hbase.apache.org
>>> > Date: Fri, 29 Oct 2010 09:05:28 +0100
>>> > Subject: Re: HBase not scaling well
>>> >
>>> > Hi Hari,
>>> >
>>> > Could you do some realtime monitoring (htop, iptraf, iostat) and
>>> report the results? Also you could add some timers to the map-reduce
>>> operations: measure average operations times to figure out what's
>>> taking so long.
>>> >
>>> > Cosmin
>>> > On Oct 29, 2010, at 9:55 AM, Hari Shankar wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > >     We are currently doing a POC for HBase in our system. We have
>>> > > written a bulk upload job to upload our data from a text file into
>>> > > HBase. We are using a 3-node cluster, one master which also works
>>> as
>>> > > slave (running as namenode, jobtracker, HMaster, datanode,
>>> > > tasktracker, HQuorumpeer and  HRegionServer) and 2 slaves
>>> (datanode,
>>> > > tasktracker, HQuorumpeer and  HRegionServer running). The problem
>>> is
>>> > > that we are getting lower performance from distributed cluster than
>>> > > what we were getting from single-node pseudo distributed node. The
>>> > > upload is taking about 30  minutes on an individual machine,
>>> whereas
>>> > > it is taking 2 hrs on the cluster. We have replication set to 3, so
>>> > > all parts should ideally be available on all nodes, so we doubt if
>>> the
>>> > > problem is network latency. scp of files between nodes gives a
>>> speed
>>> > > of about 12 MB/s, which I believe should be good enough for this to
>>> > > function. Please correct me if I am wrong here. The nodes are all 4
>>> > > core machines with 8 GB RAM.  We are spawning 4 simultaneous map
>>> tasks
>>> > > on each node, and the job does not have any reduce phase. Any help
>>> is
>>> > > greatly appreciated.
>>> > >
>>> > > Thanks,
>>> > > Hari Shankar
>>> >
>>>
>>
>

Re: HBase not scaling well

Posted by Stack <st...@duboce.net>.
Thanks for updating the list with your findings Hari.  A few of us
were baffled there for a while.
St.Ack

On Tue, Nov 23, 2010 at 9:23 AM, Hari Shankar <sh...@gmail.com> wrote:
> Hi All,
>
>         I think there was some problem with our earlier setup. When I
> tested the 3-node setup, the three machines had different OSes (Ubuntu
> server, Ubuntu desktop and CentOS 5.4). Now, while checking the 4-node
> and 6 nodes, I had all 3 machines with exactly same OS, java version
> etc. I am not sure if that is what made the difference, but I retried
> 3-node setup with this setup and surprisingly, I got similar results
> for the 3 node setup as well. Now I get the job done in 25-30 mins
> with 3 nodes. (Single node was 35 mins average). So after all, the
> result is quite reasonable now. In any case, the drastic degradation
> on performance going from 1 to 3 nodes which I got earlier was quite
> mysterious. I guess having the same OS helps. Has anyone else faced
> similar issues/found one OS performing better than another w.r.t
> HBase? I'll update my earlier data lest anyone gets misled:
>
> Single node --> 35-45 mins (Tested on all 3 OSes individually)
> 3-node, Replication 1 --> 60-80 mins (Multi-OS cluster - Ubuntu server
> 10.10+Ubuntu Desktop 10.04+CentOS 5.4)
> 3-node, Replication 2 --> 25-30 mins (Same OS, CentOS 5.5 on all nodes)
> 3-node, Replication 3 --> 50-60 mins (Multi OS cluster)
> 4-node, Replication 1 --> 23-28 mins (Same OS)
> 4-node, Replication 4 --> ~20 mins (Same OS)
> 6-node, Replication 2 --> ~15 mins (Same OS)
>
> Regards,
> Hari
>
> On Tue, Nov 23, 2010 at 11:24 AM, Hari Shankar <sh...@gmail.com> wrote:
>> Hi Jonathan,
>>
>>         You were right, adding one more node really helped. With 4
>> nodes, the time taken by my job has now finally gone lesser than time
>> taken by single node. My job was BulkUploading of data into HBase from
>> text file using standard API (not bulk upload tool). I had 2.13 GB of
>> data. My cluster details can be found in the above posts. All my nodes
>> are same config (8 GB RAM; 4 core 3.2 GHz; 500 GB single disk; 1 Gbps
>> network). The only change is that I am using CentOS 5.5 now in all
>> nodes and java 1.6.0_22. I'll post the approximate time taken for
>> everyone's reference:
>>
>> Single node --> 35-45 mins
>> 3-node, Replication 1 --> 60-80 mins
>> 3-node, Replication 3 --> 50-60 mins
>> 4-node, Replication 1 --> 23-28 mins!
>> 4-node, Replication 4 --> ~20 mins
>> 6-node, Replication 2 --> ~15 mins
>>
>> I used 4 mappers/tasktracker and 1 reducer. I think that can be
>> tweaked for further improvement.
>>
>> I hope those stats will be useful for other newbies as well. Like
>> Jonathan said above, it seems HBase needs more nodes to really start
>> scaling. Less than 4 nodes actually deteriorates the performance.
>> Thanks for all the help, guys!
>>
>> Thanks and Regards,
>> Hari
>>
>> On Tue, Nov 2, 2010 at 6:01 PM, Hari Shankar <sh...@gmail.com> wrote:
>>> Hi all,
>>>
>>>      I ran the TestDFSIO job on my cluster and thought I'd append it
>>> here in case it is of any help:
>>>
>>> 10/11/02 17:53:56 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : write
>>> 10/11/02 17:53:56 INFO mapred.FileInputFormat:            Date & time:
>>> Tue Nov 02 17:53:56 IST 2010
>>> 10/11/02 17:53:56 INFO mapred.FileInputFormat:        Number of files: 10
>>> 10/11/02 17:53:56 INFO mapred.FileInputFormat: Total MBytes processed: 10000
>>> 10/11/02 17:53:56 INFO mapred.FileInputFormat:      Throughput mb/sec:
>>> 1.2372449326777915
>>> 10/11/02 17:53:56 INFO mapred.FileInputFormat: Average IO rate mb/sec:
>>> 1.2381720542907715
>>> 10/11/02 17:53:56 INFO mapred.FileInputFormat:  IO rate std deviation:
>>> 0.03402313342081011
>>> 10/11/02 17:53:56 INFO mapred.FileInputFormat:     Test exec time sec: 866.931
>>> 10/11/02 17:53:56 INFO mapred.FileInputFormat:
>>>
>>> 10/11/02 17:59:35 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : read
>>> 10/11/02 17:59:35 INFO mapred.FileInputFormat:            Date & time:
>>> Tue Nov 02 17:59:35 IST 2010
>>> 10/11/02 17:59:35 INFO mapred.FileInputFormat:        Number of files: 10
>>> 10/11/02 17:59:35 INFO mapred.FileInputFormat: Total MBytes processed: 10000
>>> 10/11/02 17:59:35 INFO mapred.FileInputFormat:      Throughput mb/sec:
>>> 22.776708537849196
>>> 10/11/02 17:59:35 INFO mapred.FileInputFormat: Average IO rate mb/sec:
>>> 28.383480072021484
>>> 10/11/02 17:59:35 INFO mapred.FileInputFormat:  IO rate std deviation:
>>> 12.521607590777203
>>> 10/11/02 17:59:35 INFO mapred.FileInputFormat:     Test exec time sec: 108.735
>>> 10/11/02 17:59:35 INFO mapred.FileInputFormat:
>>>
>>> For a 3 node cluster, is this good/bad/ugly..? Where can I find data
>>> to compare my cluster regarding such parameters?
>>>
>>> Thanks,
>>> Hari
>>>
>>>
>>>
>>> On Sun, Oct 31, 2010 at 11:59 PM, Hari Shankar <sh...@gmail.com> wrote:
>>>> Thanks guys for the replies, and very sorry for the late reply. We are
>>>> quite new to linux environment... our production servers are currently
>>>> running on windows and our linux sysadmin is yet to arrive. So please
>>>> forgive my ignorance regarding linux tools. Very little prior
>>>> experience in linux. All our 3 nodes are running on different linux
>>>> distros - one on ubuntu server 10.10, one on CentOS and one on
>>>> Ubuntu-desktop 10.04. All have the same directory structure and same
>>>> versions of hadoop, hbase and java though. Let me know if you think
>>>> this could be an issue. Basically we wanted to evaluate all three
>>>> distros at the same time as well. I hope than plan didn't backfire.
>>>>
>>>> Back to the problem at hand, here are the iptraf, htop and iostat reports:
>>>>
>>>> iptraf snapshot --master
>>>>
>>>> Total rates:
>>>> 165424.2 kbits/s
>>>> 3800 packets/s
>>>>
>>>> Incoming:
>>>> 109415.3 kbits/s
>>>> 3007.4 packets/s
>>>>
>>>> iptraf snapshot --slave01
>>>>
>>>> Total rates:
>>>> 102024 kbits/s
>>>> 3128 packets/s
>>>>
>>>> Incoming:
>>>> 48755.9 kbits/s
>>>> 1784 packets/s
>>>>
>>>> iostat --master
>>>>
>>>> Linux 2.6.32-21-generic (hadoop1)       Sunday 31 October 2010  _x86_64_        (4 CPU)
>>>>
>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>                    0.54    0.01      0.18     0.30         0.00   98.97
>>>>
>>>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>>>> sda               2.43       123.11       412.93   33988462  114000368
>>>>
>>>> iostat --slave01
>>>>
>>>> Linux 2.6.35-22-server (hadoop2)        Sunday 31 October 2010  _x86_64_        (4 CPU)
>>>>
>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>                   0.77    0.00    0.29    0.18    0.00   98.77
>>>>
>>>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>>>> sda               3.90       277.19      1228.97  245515598 1088525808
>>>>
>>>>
>>>> iostat --slave02
>>>>
>>>> Linux 2.6.18-194.11.1.el5 (slave)       10/31/2010
>>>>
>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>                  0.54    0.00             0.29    0.80    0.00   98.37
>>>>
>>>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>>>> sda               6.57       302.09      1652.06  321914364 1760497088
>>>> sda1              0.00         0.00         0.00       2458         88
>>>> sda2              6.57       302.08      1652.06  321911602 1760497000
>>>> dm-0            209.33       302.08      1652.06  321910322 1760496272
>>>> dm-1           <F5>   0.00         0.00         0.00        896        728
>>>>
>>>> htop --master
>>>> http://imgur.com/3zTu7
>>>>
>>>> htop --slave01
>>>> http://imgur.com/5HeyF
>>>>
>>>> htop --slave02
>>>> http://imgur.com/lHin7
>>>>
>>>> I hope these are the reports that you were referring to. Please let me
>>>> otherwise. Also, is there an easier command-line way of fetching the
>>>> iptraf and htop reports? master is running ubuntu desktop, slave01
>>>> runs ubuntu server and slave02 runs CentOS.
>>>> Some more facts that I have noticed:
>>>> - I ran the job just now on the cluster after reformatting the
>>>> namenode and it took only 1 hr 15 mins instead of the usual 2 hrs,
>>>> though still slower than the single node config (30-40 mins). Can it
>>>> be that it is faster right after a namenode format?
>>>> - The time set on one of the slaves was incorrect and it lagged by 4
>>>> hrs compared to the other two machines. I corrected the time before
>>>> formatting the namenode this time. I wonder if that could have an
>>>> impact.
>>>> - I have ZK running on all 3 machines. Shouldn't it work fine if I
>>>> just set up ZK on one of the nodes. In that case, I get a weird error:
>>>> could not connect to port 0::0::0::0::....:2181 or something of that
>>>> sort. I'll post the full error next time I see it.
>>>> - The CentOS machine (slave02) seems to use a lot more CPU than the
>>>> other 2 guys on an average. CPU usage in centos hovers around 50-60%
>>>> mostly whereas it is more like 30-40% on the other 2 machines. (ref.
>>>> htop screenshots above).
>>>> - One a single-node configuration, moving from a 4 GB-RAM dual core
>>>> laptop to an 8 GB-quad core machine gives a 1.8x performance
>>>> improvement.
>>>> - Increasing the child task heap size from the default 200 MB to 768
>>>> MB improved performance on both single and multi node clusters by 100%
>>>> (2x improvement). But going beyond 768 MB doesn't seem to have much
>>>> impact.
>>>>
>>>> Michael and Jonathan, I think I have covered most of the info you guys
>>>> had asked for as well above. It doesn't seem to be swapping, and yes,
>>>> currently we are running all thoise processes on the master, and all
>>>> processes minus namenode, secondary namenode and JT on the slaves. But
>>>> we run all those processes on a single machine in case of single node
>>>> as well, right? So if RAM/Swap was the culprit, shouldn't it effect
>>>> single-node config more?
>>>>
>>>> Do let me know if anything is missing or you think more info would
>>>> help. Many thanks for your time and patience. :)
>>>>
>>>> Thanks,
>>>> Hari
>>>>
>>>> On Fri, Oct 29, 2010 at 9:51 PM, Jonathan Gray <jg...@facebook.com> wrote:
>>>>> Going from pseudo-distributed mode to a 3 node setup is definitely not "scaling" in a real way and I would expect performance degradation.  Most especially when you're also running at replication factor 3 and in a setup where the master node is also acting as a slave node and MR task node.
>>>>>
>>>>> You're adding an entirely new layer (HDFS) which will always cause increased latency/decreased throughput, and then you're running on 3 nodes with a replication factor of 3.  So now every write is going to all three nodes, via HDFS, rather than a single node straight to the FS.
>>>>>
>>>>> You said that "all parts should ideally be available on all nodes", but this is a write test?  So that's a bad thing not a good thing.
>>>>>
>>>>> I would expect about a 50% slowdown but you're seeing more like 75% slowdown.  Not so out of the ordinary still.  Stuffing a NN, DN, JT, TT, HMaster, and RS onto a single node is not a great idea.  And then you're running 4 simultaneous tasks on a 4 core machine (along with these 6 other processes in the case of the master node).
>>>>>
>>>>> How many disks do each of your nodes have?
>>>>>
>>>>> If you really want to "scale" HBase, you're going to need more nodes.  I've seen some success at a 5 node level but generally 10 nodes and up is when HBase does well (and replication 3 makes sense).
>>>>>
>>>>> JG
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Michael Segel [mailto:michael_segel@hotmail.com]
>>>>>> Sent: Friday, October 29, 2010 8:03 AM
>>>>>> To: user@hbase.apache.org
>>>>>> Subject: RE: HBase not scaling well
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'd actually take a step back and ask what Hari is trying to do?
>>>>>>
>>>>>> Its difficult to figure out what the problem is when the OP says I've
>>>>>> got code that works on individual psuedo mode, but not in an actual
>>>>>> cluster.
>>>>>> It would be nice to know version(s), configuration... 3 nodes... are
>>>>>> they running ZK on the same machines that they are running Region
>>>>>> Servers... Are they swapping? 8GB of memory can disappear quickly...
>>>>>>
>>>>>> Lots of questions...
>>>>>>
>>>>>>
>>>>>> > From: clehene@adobe.com
>>>>>> > To: user@hbase.apache.org
>>>>>> > Date: Fri, 29 Oct 2010 09:05:28 +0100
>>>>>> > Subject: Re: HBase not scaling well
>>>>>> >
>>>>>> > Hi Hari,
>>>>>> >
>>>>>> > Could you do some realtime monitoring (htop, iptraf, iostat) and
>>>>>> report the results? Also you could add some timers to the map-reduce
>>>>>> operations: measure average operations times to figure out what's
>>>>>> taking so long.
>>>>>> >
>>>>>> > Cosmin
>>>>>> > On Oct 29, 2010, at 9:55 AM, Hari Shankar wrote:
>>>>>> >
>>>>>> > > Hi,
>>>>>> > >
>>>>>> > >     We are currently doing a POC for HBase in our system. We have
>>>>>> > > written a bulk upload job to upload our data from a text file into
>>>>>> > > HBase. We are using a 3-node cluster, one master which also works
>>>>>> as
>>>>>> > > slave (running as namenode, jobtracker, HMaster, datanode,
>>>>>> > > tasktracker, HQuorumpeer and  HRegionServer) and 2 slaves
>>>>>> (datanode,
>>>>>> > > tasktracker, HQuorumpeer and  HRegionServer running). The problem
>>>>>> is
>>>>>> > > that we are getting lower performance from distributed cluster than
>>>>>> > > what we were getting from single-node pseudo distributed node. The
>>>>>> > > upload is taking about 30  minutes on an individual machine,
>>>>>> whereas
>>>>>> > > it is taking 2 hrs on the cluster. We have replication set to 3, so
>>>>>> > > all parts should ideally be available on all nodes, so we doubt if
>>>>>> the
>>>>>> > > problem is network latency. scp of files between nodes gives a
>>>>>> speed
>>>>>> > > of about 12 MB/s, which I believe should be good enough for this to
>>>>>> > > function. Please correct me if I am wrong here. The nodes are all 4
>>>>>> > > core machines with 8 GB RAM.  We are spawning 4 simultaneous map
>>>>>> tasks
>>>>>> > > on each node, and the job does not have any reduce phase. Any help
>>>>>> is
>>>>>> > > greatly appreciated.
>>>>>> > >
>>>>>> > > Thanks,
>>>>>> > > Hari Shankar
>>>>>> >
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: HBase not scaling well

Posted by Hari Shankar <sh...@gmail.com>.
Hi All,

         I think there was some problem with our earlier setup. When I
tested the 3-node setup, the three machines had different OSes (Ubuntu
server, Ubuntu desktop and CentOS 5.4). Now, while checking the 4-node
and 6 nodes, I had all 3 machines with exactly same OS, java version
etc. I am not sure if that is what made the difference, but I retried
3-node setup with this setup and surprisingly, I got similar results
for the 3 node setup as well. Now I get the job done in 25-30 mins
with 3 nodes. (Single node was 35 mins average). So after all, the
result is quite reasonable now. In any case, the drastic degradation
on performance going from 1 to 3 nodes which I got earlier was quite
mysterious. I guess having the same OS helps. Has anyone else faced
similar issues/found one OS performing better than another w.r.t
HBase? I'll update my earlier data lest anyone gets misled:

Single node --> 35-45 mins (Tested on all 3 OSes individually)
3-node, Replication 1 --> 60-80 mins (Multi-OS cluster - Ubuntu server
10.10+Ubuntu Desktop 10.04+CentOS 5.4)
3-node, Replication 2 --> 25-30 mins (Same OS, CentOS 5.5 on all nodes)
3-node, Replication 3 --> 50-60 mins (Multi OS cluster)
4-node, Replication 1 --> 23-28 mins (Same OS)
4-node, Replication 4 --> ~20 mins (Same OS)
6-node, Replication 2 --> ~15 mins (Same OS)

Regards,
Hari

On Tue, Nov 23, 2010 at 11:24 AM, Hari Shankar <sh...@gmail.com> wrote:
> Hi Jonathan,
>
>         You were right, adding one more node really helped. With 4
> nodes, the time taken by my job has now finally gone lesser than time
> taken by single node. My job was BulkUploading of data into HBase from
> text file using standard API (not bulk upload tool). I had 2.13 GB of
> data. My cluster details can be found in the above posts. All my nodes
> are same config (8 GB RAM; 4 core 3.2 GHz; 500 GB single disk; 1 Gbps
> network). The only change is that I am using CentOS 5.5 now in all
> nodes and java 1.6.0_22. I'll post the approximate time taken for
> everyone's reference:
>
> Single node --> 35-45 mins
> 3-node, Replication 1 --> 60-80 mins
> 3-node, Replication 3 --> 50-60 mins
> 4-node, Replication 1 --> 23-28 mins!
> 4-node, Replication 4 --> ~20 mins
> 6-node, Replication 2 --> ~15 mins
>
> I used 4 mappers/tasktracker and 1 reducer. I think that can be
> tweaked for further improvement.
>
> I hope those stats will be useful for other newbies as well. Like
> Jonathan said above, it seems HBase needs more nodes to really start
> scaling. Less than 4 nodes actually deteriorates the performance.
> Thanks for all the help, guys!
>
> Thanks and Regards,
> Hari
>
> On Tue, Nov 2, 2010 at 6:01 PM, Hari Shankar <sh...@gmail.com> wrote:
>> Hi all,
>>
>>      I ran the TestDFSIO job on my cluster and thought I'd append it
>> here in case it is of any help:
>>
>> 10/11/02 17:53:56 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : write
>> 10/11/02 17:53:56 INFO mapred.FileInputFormat:            Date & time:
>> Tue Nov 02 17:53:56 IST 2010
>> 10/11/02 17:53:56 INFO mapred.FileInputFormat:        Number of files: 10
>> 10/11/02 17:53:56 INFO mapred.FileInputFormat: Total MBytes processed: 10000
>> 10/11/02 17:53:56 INFO mapred.FileInputFormat:      Throughput mb/sec:
>> 1.2372449326777915
>> 10/11/02 17:53:56 INFO mapred.FileInputFormat: Average IO rate mb/sec:
>> 1.2381720542907715
>> 10/11/02 17:53:56 INFO mapred.FileInputFormat:  IO rate std deviation:
>> 0.03402313342081011
>> 10/11/02 17:53:56 INFO mapred.FileInputFormat:     Test exec time sec: 866.931
>> 10/11/02 17:53:56 INFO mapred.FileInputFormat:
>>
>> 10/11/02 17:59:35 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : read
>> 10/11/02 17:59:35 INFO mapred.FileInputFormat:            Date & time:
>> Tue Nov 02 17:59:35 IST 2010
>> 10/11/02 17:59:35 INFO mapred.FileInputFormat:        Number of files: 10
>> 10/11/02 17:59:35 INFO mapred.FileInputFormat: Total MBytes processed: 10000
>> 10/11/02 17:59:35 INFO mapred.FileInputFormat:      Throughput mb/sec:
>> 22.776708537849196
>> 10/11/02 17:59:35 INFO mapred.FileInputFormat: Average IO rate mb/sec:
>> 28.383480072021484
>> 10/11/02 17:59:35 INFO mapred.FileInputFormat:  IO rate std deviation:
>> 12.521607590777203
>> 10/11/02 17:59:35 INFO mapred.FileInputFormat:     Test exec time sec: 108.735
>> 10/11/02 17:59:35 INFO mapred.FileInputFormat:
>>
>> For a 3 node cluster, is this good/bad/ugly..? Where can I find data
>> to compare my cluster regarding such parameters?
>>
>> Thanks,
>> Hari
>>
>>
>>
>> On Sun, Oct 31, 2010 at 11:59 PM, Hari Shankar <sh...@gmail.com> wrote:
>>> Thanks guys for the replies, and very sorry for the late reply. We are
>>> quite new to linux environment... our production servers are currently
>>> running on windows and our linux sysadmin is yet to arrive. So please
>>> forgive my ignorance regarding linux tools. Very little prior
>>> experience in linux. All our 3 nodes are running on different linux
>>> distros - one on ubuntu server 10.10, one on CentOS and one on
>>> Ubuntu-desktop 10.04. All have the same directory structure and same
>>> versions of hadoop, hbase and java though. Let me know if you think
>>> this could be an issue. Basically we wanted to evaluate all three
>>> distros at the same time as well. I hope than plan didn't backfire.
>>>
>>> Back to the problem at hand, here are the iptraf, htop and iostat reports:
>>>
>>> iptraf snapshot --master
>>>
>>> Total rates:
>>> 165424.2 kbits/s
>>> 3800 packets/s
>>>
>>> Incoming:
>>> 109415.3 kbits/s
>>> 3007.4 packets/s
>>>
>>> iptraf snapshot --slave01
>>>
>>> Total rates:
>>> 102024 kbits/s
>>> 3128 packets/s
>>>
>>> Incoming:
>>> 48755.9 kbits/s
>>> 1784 packets/s
>>>
>>> iostat --master
>>>
>>> Linux 2.6.32-21-generic (hadoop1)       Sunday 31 October 2010  _x86_64_        (4 CPU)
>>>
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>                    0.54    0.01      0.18     0.30         0.00   98.97
>>>
>>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>>> sda               2.43       123.11       412.93   33988462  114000368
>>>
>>> iostat --slave01
>>>
>>> Linux 2.6.35-22-server (hadoop2)        Sunday 31 October 2010  _x86_64_        (4 CPU)
>>>
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>                   0.77    0.00    0.29    0.18    0.00   98.77
>>>
>>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>>> sda               3.90       277.19      1228.97  245515598 1088525808
>>>
>>>
>>> iostat --slave02
>>>
>>> Linux 2.6.18-194.11.1.el5 (slave)       10/31/2010
>>>
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>                  0.54    0.00             0.29    0.80    0.00   98.37
>>>
>>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>>> sda               6.57       302.09      1652.06  321914364 1760497088
>>> sda1              0.00         0.00         0.00       2458         88
>>> sda2              6.57       302.08      1652.06  321911602 1760497000
>>> dm-0            209.33       302.08      1652.06  321910322 1760496272
>>> dm-1           <F5>   0.00         0.00         0.00        896        728
>>>
>>> htop --master
>>> http://imgur.com/3zTu7
>>>
>>> htop --slave01
>>> http://imgur.com/5HeyF
>>>
>>> htop --slave02
>>> http://imgur.com/lHin7
>>>
>>> I hope these are the reports that you were referring to. Please let me
>>> otherwise. Also, is there an easier command-line way of fetching the
>>> iptraf and htop reports? master is running ubuntu desktop, slave01
>>> runs ubuntu server and slave02 runs CentOS.
>>> Some more facts that I have noticed:
>>> - I ran the job just now on the cluster after reformatting the
>>> namenode and it took only 1 hr 15 mins instead of the usual 2 hrs,
>>> though still slower than the single node config (30-40 mins). Can it
>>> be that it is faster right after a namenode format?
>>> - The time set on one of the slaves was incorrect and it lagged by 4
>>> hrs compared to the other two machines. I corrected the time before
>>> formatting the namenode this time. I wonder if that could have an
>>> impact.
>>> - I have ZK running on all 3 machines. Shouldn't it work fine if I
>>> just set up ZK on one of the nodes. In that case, I get a weird error:
>>> could not connect to port 0::0::0::0::....:2181 or something of that
>>> sort. I'll post the full error next time I see it.
>>> - The CentOS machine (slave02) seems to use a lot more CPU than the
>>> other 2 guys on an average. CPU usage in centos hovers around 50-60%
>>> mostly whereas it is more like 30-40% on the other 2 machines. (ref.
>>> htop screenshots above).
>>> - One a single-node configuration, moving from a 4 GB-RAM dual core
>>> laptop to an 8 GB-quad core machine gives a 1.8x performance
>>> improvement.
>>> - Increasing the child task heap size from the default 200 MB to 768
>>> MB improved performance on both single and multi node clusters by 100%
>>> (2x improvement). But going beyond 768 MB doesn't seem to have much
>>> impact.
>>>
>>> Michael and Jonathan, I think I have covered most of the info you guys
>>> had asked for as well above. It doesn't seem to be swapping, and yes,
>>> currently we are running all thoise processes on the master, and all
>>> processes minus namenode, secondary namenode and JT on the slaves. But
>>> we run all those processes on a single machine in case of single node
>>> as well, right? So if RAM/Swap was the culprit, shouldn't it effect
>>> single-node config more?
>>>
>>> Do let me know if anything is missing or you think more info would
>>> help. Many thanks for your time and patience. :)
>>>
>>> Thanks,
>>> Hari
>>>
>>> On Fri, Oct 29, 2010 at 9:51 PM, Jonathan Gray <jg...@facebook.com> wrote:
>>>> Going from pseudo-distributed mode to a 3 node setup is definitely not "scaling" in a real way and I would expect performance degradation.  Most especially when you're also running at replication factor 3 and in a setup where the master node is also acting as a slave node and MR task node.
>>>>
>>>> You're adding an entirely new layer (HDFS) which will always cause increased latency/decreased throughput, and then you're running on 3 nodes with a replication factor of 3.  So now every write is going to all three nodes, via HDFS, rather than a single node straight to the FS.
>>>>
>>>> You said that "all parts should ideally be available on all nodes", but this is a write test?  So that's a bad thing not a good thing.
>>>>
>>>> I would expect about a 50% slowdown but you're seeing more like 75% slowdown.  Not so out of the ordinary still.  Stuffing a NN, DN, JT, TT, HMaster, and RS onto a single node is not a great idea.  And then you're running 4 simultaneous tasks on a 4 core machine (along with these 6 other processes in the case of the master node).
>>>>
>>>> How many disks do each of your nodes have?
>>>>
>>>> If you really want to "scale" HBase, you're going to need more nodes.  I've seen some success at a 5 node level but generally 10 nodes and up is when HBase does well (and replication 3 makes sense).
>>>>
>>>> JG
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Michael Segel [mailto:michael_segel@hotmail.com]
>>>>> Sent: Friday, October 29, 2010 8:03 AM
>>>>> To: user@hbase.apache.org
>>>>> Subject: RE: HBase not scaling well
>>>>>
>>>>>
>>>>>
>>>>> I'd actually take a step back and ask what Hari is trying to do?
>>>>>
>>>>> Its difficult to figure out what the problem is when the OP says I've
>>>>> got code that works on individual psuedo mode, but not in an actual
>>>>> cluster.
>>>>> It would be nice to know version(s), configuration... 3 nodes... are
>>>>> they running ZK on the same machines that they are running Region
>>>>> Servers... Are they swapping? 8GB of memory can disappear quickly...
>>>>>
>>>>> Lots of questions...
>>>>>
>>>>>
>>>>> > From: clehene@adobe.com
>>>>> > To: user@hbase.apache.org
>>>>> > Date: Fri, 29 Oct 2010 09:05:28 +0100
>>>>> > Subject: Re: HBase not scaling well
>>>>> >
>>>>> > Hi Hari,
>>>>> >
>>>>> > Could you do some realtime monitoring (htop, iptraf, iostat) and
>>>>> report the results? Also you could add some timers to the map-reduce
>>>>> operations: measure average operations times to figure out what's
>>>>> taking so long.
>>>>> >
>>>>> > Cosmin
>>>>> > On Oct 29, 2010, at 9:55 AM, Hari Shankar wrote:
>>>>> >
>>>>> > > Hi,
>>>>> > >
>>>>> > >     We are currently doing a POC for HBase in our system. We have
>>>>> > > written a bulk upload job to upload our data from a text file into
>>>>> > > HBase. We are using a 3-node cluster, one master which also works
>>>>> as
>>>>> > > slave (running as namenode, jobtracker, HMaster, datanode,
>>>>> > > tasktracker, HQuorumpeer and  HRegionServer) and 2 slaves
>>>>> (datanode,
>>>>> > > tasktracker, HQuorumpeer and  HRegionServer running). The problem
>>>>> is
>>>>> > > that we are getting lower performance from distributed cluster than
>>>>> > > what we were getting from single-node pseudo distributed node. The
>>>>> > > upload is taking about 30  minutes on an individual machine,
>>>>> whereas
>>>>> > > it is taking 2 hrs on the cluster. We have replication set to 3, so
>>>>> > > all parts should ideally be available on all nodes, so we doubt if
>>>>> the
>>>>> > > problem is network latency. scp of files between nodes gives a
>>>>> speed
>>>>> > > of about 12 MB/s, which I believe should be good enough for this to
>>>>> > > function. Please correct me if I am wrong here. The nodes are all 4
>>>>> > > core machines with 8 GB RAM.  We are spawning 4 simultaneous map
>>>>> tasks
>>>>> > > on each node, and the job does not have any reduce phase. Any help
>>>>> is
>>>>> > > greatly appreciated.
>>>>> > >
>>>>> > > Thanks,
>>>>> > > Hari Shankar
>>>>> >
>>>>>
>>>>
>>>
>>
>

Re: HBase not scaling well

Posted by Hari Shankar <sh...@gmail.com>.
Hi Jonathan,

         You were right, adding one more node really helped. With 4
nodes, the time taken by my job has now finally gone lesser than time
taken by single node. My job was BulkUploading of data into HBase from
text file using standard API (not bulk upload tool). I had 2.13 GB of
data. My cluster details can be found in the above posts. All my nodes
are same config (8 GB RAM; 4 core 3.2 GHz; 500 GB single disk; 1 Gbps
network). The only change is that I am using CentOS 5.5 now in all
nodes and java 1.6.0_22. I'll post the approximate time taken for
everyone's reference:

Single node --> 35-45 mins
3-node, Replication 1 --> 60-80 mins
3-node, Replication 3 --> 50-60 mins
4-node, Replication 1 --> 23-28 mins!
4-node, Replication 4 --> ~20 mins
6-node, Replication 2 --> ~15 mins

I used 4 mappers/tasktracker and 1 reducer. I think that can be
tweaked for further improvement.

I hope those stats will be useful for other newbies as well. Like
Jonathan said above, it seems HBase needs more nodes to really start
scaling. Less than 4 nodes actually deteriorates the performance.
Thanks for all the help, guys!

Thanks and Regards,
Hari

On Tue, Nov 2, 2010 at 6:01 PM, Hari Shankar <sh...@gmail.com> wrote:
> Hi all,
>
>      I ran the TestDFSIO job on my cluster and thought I'd append it
> here in case it is of any help:
>
> 10/11/02 17:53:56 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : write
> 10/11/02 17:53:56 INFO mapred.FileInputFormat:            Date & time:
> Tue Nov 02 17:53:56 IST 2010
> 10/11/02 17:53:56 INFO mapred.FileInputFormat:        Number of files: 10
> 10/11/02 17:53:56 INFO mapred.FileInputFormat: Total MBytes processed: 10000
> 10/11/02 17:53:56 INFO mapred.FileInputFormat:      Throughput mb/sec:
> 1.2372449326777915
> 10/11/02 17:53:56 INFO mapred.FileInputFormat: Average IO rate mb/sec:
> 1.2381720542907715
> 10/11/02 17:53:56 INFO mapred.FileInputFormat:  IO rate std deviation:
> 0.03402313342081011
> 10/11/02 17:53:56 INFO mapred.FileInputFormat:     Test exec time sec: 866.931
> 10/11/02 17:53:56 INFO mapred.FileInputFormat:
>
> 10/11/02 17:59:35 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : read
> 10/11/02 17:59:35 INFO mapred.FileInputFormat:            Date & time:
> Tue Nov 02 17:59:35 IST 2010
> 10/11/02 17:59:35 INFO mapred.FileInputFormat:        Number of files: 10
> 10/11/02 17:59:35 INFO mapred.FileInputFormat: Total MBytes processed: 10000
> 10/11/02 17:59:35 INFO mapred.FileInputFormat:      Throughput mb/sec:
> 22.776708537849196
> 10/11/02 17:59:35 INFO mapred.FileInputFormat: Average IO rate mb/sec:
> 28.383480072021484
> 10/11/02 17:59:35 INFO mapred.FileInputFormat:  IO rate std deviation:
> 12.521607590777203
> 10/11/02 17:59:35 INFO mapred.FileInputFormat:     Test exec time sec: 108.735
> 10/11/02 17:59:35 INFO mapred.FileInputFormat:
>
> For a 3 node cluster, is this good/bad/ugly..? Where can I find data
> to compare my cluster regarding such parameters?
>
> Thanks,
> Hari
>
>
>
> On Sun, Oct 31, 2010 at 11:59 PM, Hari Shankar <sh...@gmail.com> wrote:
>> Thanks guys for the replies, and very sorry for the late reply. We are
>> quite new to linux environment... our production servers are currently
>> running on windows and our linux sysadmin is yet to arrive. So please
>> forgive my ignorance regarding linux tools. Very little prior
>> experience in linux. All our 3 nodes are running on different linux
>> distros - one on ubuntu server 10.10, one on CentOS and one on
>> Ubuntu-desktop 10.04. All have the same directory structure and same
>> versions of hadoop, hbase and java though. Let me know if you think
>> this could be an issue. Basically we wanted to evaluate all three
>> distros at the same time as well. I hope than plan didn't backfire.
>>
>> Back to the problem at hand, here are the iptraf, htop and iostat reports:
>>
>> iptraf snapshot --master
>>
>> Total rates:
>> 165424.2 kbits/s
>> 3800 packets/s
>>
>> Incoming:
>> 109415.3 kbits/s
>> 3007.4 packets/s
>>
>> iptraf snapshot --slave01
>>
>> Total rates:
>> 102024 kbits/s
>> 3128 packets/s
>>
>> Incoming:
>> 48755.9 kbits/s
>> 1784 packets/s
>>
>> iostat --master
>>
>> Linux 2.6.32-21-generic (hadoop1)       Sunday 31 October 2010  _x86_64_        (4 CPU)
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>                    0.54    0.01      0.18     0.30         0.00   98.97
>>
>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>> sda               2.43       123.11       412.93   33988462  114000368
>>
>> iostat --slave01
>>
>> Linux 2.6.35-22-server (hadoop2)        Sunday 31 October 2010  _x86_64_        (4 CPU)
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>                   0.77    0.00    0.29    0.18    0.00   98.77
>>
>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>> sda               3.90       277.19      1228.97  245515598 1088525808
>>
>>
>> iostat --slave02
>>
>> Linux 2.6.18-194.11.1.el5 (slave)       10/31/2010
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>                  0.54    0.00             0.29    0.80    0.00   98.37
>>
>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>> sda               6.57       302.09      1652.06  321914364 1760497088
>> sda1              0.00         0.00         0.00       2458         88
>> sda2              6.57       302.08      1652.06  321911602 1760497000
>> dm-0            209.33       302.08      1652.06  321910322 1760496272
>> dm-1           <F5>   0.00         0.00         0.00        896        728
>>
>> htop --master
>> http://imgur.com/3zTu7
>>
>> htop --slave01
>> http://imgur.com/5HeyF
>>
>> htop --slave02
>> http://imgur.com/lHin7
>>
>> I hope these are the reports that you were referring to. Please let me
>> otherwise. Also, is there an easier command-line way of fetching the
>> iptraf and htop reports? master is running ubuntu desktop, slave01
>> runs ubuntu server and slave02 runs CentOS.
>> Some more facts that I have noticed:
>> - I ran the job just now on the cluster after reformatting the
>> namenode and it took only 1 hr 15 mins instead of the usual 2 hrs,
>> though still slower than the single node config (30-40 mins). Can it
>> be that it is faster right after a namenode format?
>> - The time set on one of the slaves was incorrect and it lagged by 4
>> hrs compared to the other two machines. I corrected the time before
>> formatting the namenode this time. I wonder if that could have an
>> impact.
>> - I have ZK running on all 3 machines. Shouldn't it work fine if I
>> just set up ZK on one of the nodes. In that case, I get a weird error:
>> could not connect to port 0::0::0::0::....:2181 or something of that
>> sort. I'll post the full error next time I see it.
>> - The CentOS machine (slave02) seems to use a lot more CPU than the
>> other 2 guys on an average. CPU usage in centos hovers around 50-60%
>> mostly whereas it is more like 30-40% on the other 2 machines. (ref.
>> htop screenshots above).
>> - One a single-node configuration, moving from a 4 GB-RAM dual core
>> laptop to an 8 GB-quad core machine gives a 1.8x performance
>> improvement.
>> - Increasing the child task heap size from the default 200 MB to 768
>> MB improved performance on both single and multi node clusters by 100%
>> (2x improvement). But going beyond 768 MB doesn't seem to have much
>> impact.
>>
>> Michael and Jonathan, I think I have covered most of the info you guys
>> had asked for as well above. It doesn't seem to be swapping, and yes,
>> currently we are running all thoise processes on the master, and all
>> processes minus namenode, secondary namenode and JT on the slaves. But
>> we run all those processes on a single machine in case of single node
>> as well, right? So if RAM/Swap was the culprit, shouldn't it effect
>> single-node config more?
>>
>> Do let me know if anything is missing or you think more info would
>> help. Many thanks for your time and patience. :)
>>
>> Thanks,
>> Hari
>>
>> On Fri, Oct 29, 2010 at 9:51 PM, Jonathan Gray <jg...@facebook.com> wrote:
>>> Going from pseudo-distributed mode to a 3 node setup is definitely not "scaling" in a real way and I would expect performance degradation.  Most especially when you're also running at replication factor 3 and in a setup where the master node is also acting as a slave node and MR task node.
>>>
>>> You're adding an entirely new layer (HDFS) which will always cause increased latency/decreased throughput, and then you're running on 3 nodes with a replication factor of 3.  So now every write is going to all three nodes, via HDFS, rather than a single node straight to the FS.
>>>
>>> You said that "all parts should ideally be available on all nodes", but this is a write test?  So that's a bad thing not a good thing.
>>>
>>> I would expect about a 50% slowdown but you're seeing more like 75% slowdown.  Not so out of the ordinary still.  Stuffing a NN, DN, JT, TT, HMaster, and RS onto a single node is not a great idea.  And then you're running 4 simultaneous tasks on a 4 core machine (along with these 6 other processes in the case of the master node).
>>>
>>> How many disks do each of your nodes have?
>>>
>>> If you really want to "scale" HBase, you're going to need more nodes.  I've seen some success at a 5 node level but generally 10 nodes and up is when HBase does well (and replication 3 makes sense).
>>>
>>> JG
>>>
>>>
>>>> -----Original Message-----
>>>> From: Michael Segel [mailto:michael_segel@hotmail.com]
>>>> Sent: Friday, October 29, 2010 8:03 AM
>>>> To: user@hbase.apache.org
>>>> Subject: RE: HBase not scaling well
>>>>
>>>>
>>>>
>>>> I'd actually take a step back and ask what Hari is trying to do?
>>>>
>>>> Its difficult to figure out what the problem is when the OP says I've
>>>> got code that works on individual psuedo mode, but not in an actual
>>>> cluster.
>>>> It would be nice to know version(s), configuration... 3 nodes... are
>>>> they running ZK on the same machines that they are running Region
>>>> Servers... Are they swapping? 8GB of memory can disappear quickly...
>>>>
>>>> Lots of questions...
>>>>
>>>>
>>>> > From: clehene@adobe.com
>>>> > To: user@hbase.apache.org
>>>> > Date: Fri, 29 Oct 2010 09:05:28 +0100
>>>> > Subject: Re: HBase not scaling well
>>>> >
>>>> > Hi Hari,
>>>> >
>>>> > Could you do some realtime monitoring (htop, iptraf, iostat) and
>>>> report the results? Also you could add some timers to the map-reduce
>>>> operations: measure average operations times to figure out what's
>>>> taking so long.
>>>> >
>>>> > Cosmin
>>>> > On Oct 29, 2010, at 9:55 AM, Hari Shankar wrote:
>>>> >
>>>> > > Hi,
>>>> > >
>>>> > >     We are currently doing a POC for HBase in our system. We have
>>>> > > written a bulk upload job to upload our data from a text file into
>>>> > > HBase. We are using a 3-node cluster, one master which also works
>>>> as
>>>> > > slave (running as namenode, jobtracker, HMaster, datanode,
>>>> > > tasktracker, HQuorumpeer and  HRegionServer) and 2 slaves
>>>> (datanode,
>>>> > > tasktracker, HQuorumpeer and  HRegionServer running). The problem
>>>> is
>>>> > > that we are getting lower performance from distributed cluster than
>>>> > > what we were getting from single-node pseudo distributed node. The
>>>> > > upload is taking about 30  minutes on an individual machine,
>>>> whereas
>>>> > > it is taking 2 hrs on the cluster. We have replication set to 3, so
>>>> > > all parts should ideally be available on all nodes, so we doubt if
>>>> the
>>>> > > problem is network latency. scp of files between nodes gives a
>>>> speed
>>>> > > of about 12 MB/s, which I believe should be good enough for this to
>>>> > > function. Please correct me if I am wrong here. The nodes are all 4
>>>> > > core machines with 8 GB RAM.  We are spawning 4 simultaneous map
>>>> tasks
>>>> > > on each node, and the job does not have any reduce phase. Any help
>>>> is
>>>> > > greatly appreciated.
>>>> > >
>>>> > > Thanks,
>>>> > > Hari Shankar
>>>> >
>>>>
>>>
>>
>