You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Stanley Xu <we...@gmail.com> on 2011/05/14 13:39:50 UTC

Will different Bcast address impact the communication between the cluster？

Dear all,

We have met problem with hbase these days after a network update. Basically,
the behavior is that after 3-4 hours of the cluster startup. Some of the
RegionServer try to find the data from a deleted block.

And if we restarted the cluster, the problem just went away, and the data is
not missing.

The detail description of the problem could be found at
http://search-hadoop.com/m/ZpgJ623GoyU1/.META.+inconsistency&subj=The+META+data+inconsistency+issue

I just found some doubt issues in the network configuration of our cluster.
I found some of the cluster node has different broadcast address and Mask
comparing to other nodes, for example, as the following, the hadoopsh11092
use Bcast for 10.255.255.255 and Mask 255.0.0.0, and hadoopsh11103 use Bcast
for 10.0.2.255 and Mask 255.255.255.0

hadoopsh11092
eth0      Link encap:Ethernet  HWaddr 00:A0:D1:EE:C1:7C
          inet addr:10.0.2.19  Bcast:10.255.255.255  Mask:255.0.0.0
          inet6 addr: fe80::2a0:d1ff:feee:c17c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1864321949 errors:0 dropped:1465 overruns:0 frame:0
          TX packets:1867202791 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1811900116811 (1.6 TiB)  TX bytes:1879509303203 (1.7 TiB)
          Memory:face0000-fad00000


hadoopsh11103
eth0      Link encap:Ethernet  HWaddr 00:A0:D1:EE:AE:C4
          inet addr:10.0.2.30  Bcast:10.0.2.255  Mask:255.255.255.0
          inet6 addr: fe80::2a0:d1ff:feee:aec4/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1726779928 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1716762766 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1804202744690 (1.6 TiB)  TX bytes:1824085255121 (1.6 TiB)
          Memory:face0000-fad00000

But with these settings, we could have the cluster startup successfully and
the cluster works pretty fine after startup, the problem comes after 3-4
hours. And I could connect to different machine by SSH with their hosts name
correctly.

I knew that Zookeeper has some kind of broadcast during communication. I am
wondering if our settings should work, or it should be the root cause of our
problem?

Thanks in advance.

Best wishes,
Stanley Xu

Re: Will different Bcast address impact the communication between the cluster？

Posted by Stanley Xu <we...@gmail.com>.

And another difference we have is that, during the network upgrade, we made
each cluster node has two network cards. One for 192.168.11.* and another
for 10.0.2.*, and we found for some of the machines, the ip_forward is
turned off( 5 in 37).


I knew almost nothing about the network, so it might be a stupid question.

I am interested in if we didn't have ip_forward turned on, will it also
impact the hbase communication?


Thanks.


On Sat, May 14, 2011 at 7:39 PM, Stanley Xu <we...@gmail.com> wrote:

> Dear all,
>
> We have met problem with hbase these days after a network update.
> Basically, the behavior is that after 3-4 hours of the cluster startup. Some
> of the RegionServer try to find the data from a deleted block.
>
> And if we restarted the cluster, the problem just went away, and the data
> is not missing.
>
> The detail description of the problem could be found at
>
> http://search-hadoop.com/m/ZpgJ623GoyU1/.META.+inconsistency&subj=The+META+data+inconsistency+issue
>
> I just found some doubt issues in the network configuration of our cluster.
> I found some of the cluster node has different broadcast address and Mask
> comparing to other nodes, for example, as the following, the hadoopsh11092
> use Bcast for 10.255.255.255 and Mask 255.0.0.0, and hadoopsh11103 use Bcast
> for 10.0.2.255 and Mask 255.255.255.0
>
> hadoopsh11092
> eth0      Link encap:Ethernet  HWaddr 00:A0:D1:EE:C1:7C
>           inet addr:10.0.2.19  Bcast:10.255.255.255  Mask:255.0.0.0
>           inet6 addr: fe80::2a0:d1ff:feee:c17c/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:1864321949 errors:0 dropped:1465 overruns:0 frame:0
>           TX packets:1867202791 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:1811900116811 (1.6 TiB)  TX bytes:1879509303203 (1.7
> TiB)
>           Memory:face0000-fad00000
>
>
> hadoopsh11103
> eth0      Link encap:Ethernet  HWaddr 00:A0:D1:EE:AE:C4
>           inet addr:10.0.2.30  Bcast:10.0.2.255  Mask:255.255.255.0
>           inet6 addr: fe80::2a0:d1ff:feee:aec4/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:1726779928 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:1716762766 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:1804202744690 (1.6 TiB)  TX bytes:1824085255121 (1.6
> TiB)
>           Memory:face0000-fad00000
>
> But with these settings, we could have the cluster startup successfully and
> the cluster works pretty fine after startup, the problem comes after 3-4
> hours. And I could connect to different machine by SSH with their hosts name
> correctly.
>
> I knew that Zookeeper has some kind of broadcast during communication. I am
> wondering if our settings should work, or it should be the root cause of our
> problem?
>
> Thanks in advance.
>
> Best wishes,
> Stanley Xu
>
>