You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hawq.apache.org by Gagan Brahmi <ga...@gmail.com> on 2016/05/03 17:30:37 UTC

Master Segment Communication Issues

Hi All,

I was looking to check if anyone has seen this behavior where segments
are not able to communicate with the master in HAWQ 2.0.

In a single node setup I don't see any problem with the segment and
master communication. The problem seems to be visible if there is two
or three machine involved in the hawq cluster.

The gp_segment_configuration reports the segment. It is reported up
for a few seconds and then it turns it down. If you execute any
queries the segment is no longer found in the
gp_segment_configuration.

Nothing can be found in gp_configuration or gp_configuration_history.
hawqstate reports "failures at master" for the segments.

I found a jira which had pretty much similar behavior (except for core
dumps which I haven't seen yet. Since I am not able run any queries.).
The jira in question is https://issues.apache.org/jira/browse/HAWQ-323

This issue was closed stating duplicate IP. I am trying to understand
if that can be the case.

There is no firewall between the segments. Nothing in blocking port
40000, 5432, 5437 or 5438. Segment starts up fine. psql to segments on
port 40000 works fine as well.

There is no error in the segments or master logs or startup logs.

I tried to integrate hawq with YARN and also using it's own
ResourceManager, but found similar behavior. Tried to set the
heartbeat interval to 10 seconds (hawq_rm_segment_heartbeat_interval =
10 in hawq-site.xml) but no change in the behavior.

Am I missing anything here? Has anyone found similar behavior before?


Regards,
Gagan Brahmi

Re: Master Segment Communication Issues

Posted by Gagan Brahmi <ga...@gmail.com>.
Thank you Vineet.

I figured this problem out in the virtual environment. However, in the
physical nodes this problem seemed to have caused due to an alias for
loopback interface. The IP address which was causing the problem was
127.0.0.2.

This brings me to another question. Is there any way we can configure
HAWQ any specific IP address. In this case can we ask HAWQ to skip
127.0.0.2.


Regards,
Gagan Brahmi

On Tue, May 3, 2016 at 11:30 AM, Vineet Goel <vg...@pivotal.io> wrote:
> Are you using vagrant or VMs, or are these physical machines?
>
> Sometimes, the problem can result from the misleading IP address configuration of network card in virtual machine. Check if two segments have the same IP address in eth0.
> You must specify different IP address of eth0 of different VMs.
>
> Thanks
> -Vineet
>
>
>
> On May 3, 2016, at 8:30 AM, Gagan Brahmi <ga...@gmail.com> wrote:
>
> Hi All,
>
> I was looking to check if anyone has seen this behavior where segments
> are not able to communicate with the master in HAWQ 2.0.
>
> In a single node setup I don't see any problem with the segment and
> master communication. The problem seems to be visible if there is two
> or three machine involved in the hawq cluster.
>
> The gp_segment_configuration reports the segment. It is reported up
> for a few seconds and then it turns it down. If you execute any
> queries the segment is no longer found in the
> gp_segment_configuration.
>
> Nothing can be found in gp_configuration or gp_configuration_history.
> hawqstate reports "failures at master" for the segments.
>
> I found a jira which had pretty much similar behavior (except for core
> dumps which I haven't seen yet. Since I am not able run any queries.).
> The jira in question is https://issues.apache.org/jira/browse/HAWQ-323
>
> This issue was closed stating duplicate IP. I am trying to understand
> if that can be the case.
>
> There is no firewall between the segments. Nothing in blocking port
> 40000, 5432, 5437 or 5438. Segment starts up fine. psql to segments on
> port 40000 works fine as well.
>
> There is no error in the segments or master logs or startup logs.
>
> I tried to integrate hawq with YARN and also using it's own
> ResourceManager, but found similar behavior. Tried to set the
> heartbeat interval to 10 seconds (hawq_rm_segment_heartbeat_interval =
> 10 in hawq-site.xml) but no change in the behavior.
>
> Am I missing anything here? Has anyone found similar behavior before?
>
>
> Regards,
> Gagan Brahmi
>

Re: Master Segment Communication Issues

Posted by Gagan Brahmi <ga...@gmail.com>.
Thank you Vineet.

I figured this problem out in the virtual environment. However, in the
physical nodes this problem seemed to have caused due to an alias for
loopback interface. The IP address which was causing the problem was
127.0.0.2.

This brings me to another question. Is there any way we can configure
HAWQ any specific IP address. In this case can we ask HAWQ to skip
127.0.0.2.


Regards,
Gagan Brahmi

On Tue, May 3, 2016 at 11:30 AM, Vineet Goel <vg...@pivotal.io> wrote:
> Are you using vagrant or VMs, or are these physical machines?
>
> Sometimes, the problem can result from the misleading IP address configuration of network card in virtual machine. Check if two segments have the same IP address in eth0.
> You must specify different IP address of eth0 of different VMs.
>
> Thanks
> -Vineet
>
>
>
> On May 3, 2016, at 8:30 AM, Gagan Brahmi <ga...@gmail.com> wrote:
>
> Hi All,
>
> I was looking to check if anyone has seen this behavior where segments
> are not able to communicate with the master in HAWQ 2.0.
>
> In a single node setup I don't see any problem with the segment and
> master communication. The problem seems to be visible if there is two
> or three machine involved in the hawq cluster.
>
> The gp_segment_configuration reports the segment. It is reported up
> for a few seconds and then it turns it down. If you execute any
> queries the segment is no longer found in the
> gp_segment_configuration.
>
> Nothing can be found in gp_configuration or gp_configuration_history.
> hawqstate reports "failures at master" for the segments.
>
> I found a jira which had pretty much similar behavior (except for core
> dumps which I haven't seen yet. Since I am not able run any queries.).
> The jira in question is https://issues.apache.org/jira/browse/HAWQ-323
>
> This issue was closed stating duplicate IP. I am trying to understand
> if that can be the case.
>
> There is no firewall between the segments. Nothing in blocking port
> 40000, 5432, 5437 or 5438. Segment starts up fine. psql to segments on
> port 40000 works fine as well.
>
> There is no error in the segments or master logs or startup logs.
>
> I tried to integrate hawq with YARN and also using it's own
> ResourceManager, but found similar behavior. Tried to set the
> heartbeat interval to 10 seconds (hawq_rm_segment_heartbeat_interval =
> 10 in hawq-site.xml) but no change in the behavior.
>
> Am I missing anything here? Has anyone found similar behavior before?
>
>
> Regards,
> Gagan Brahmi
>

Re: Master Segment Communication Issues

Posted by Vineet Goel <vg...@pivotal.io>.
Are you using vagrant or VMs, or are these physical machines?

Sometimes, the problem can result from the misleading IP address configuration of network card in virtual machine. Check if two segments have the same IP address in eth0.
You must specify different IP address of eth0 of different VMs.

Thanks
-Vineet



On May 3, 2016, at 8:30 AM, Gagan Brahmi <ga...@gmail.com> wrote:

Hi All,

I was looking to check if anyone has seen this behavior where segments
are not able to communicate with the master in HAWQ 2.0.

In a single node setup I don't see any problem with the segment and
master communication. The problem seems to be visible if there is two
or three machine involved in the hawq cluster.

The gp_segment_configuration reports the segment. It is reported up
for a few seconds and then it turns it down. If you execute any
queries the segment is no longer found in the
gp_segment_configuration.

Nothing can be found in gp_configuration or gp_configuration_history.
hawqstate reports "failures at master" for the segments.

I found a jira which had pretty much similar behavior (except for core
dumps which I haven't seen yet. Since I am not able run any queries.).
The jira in question is https://issues.apache.org/jira/browse/HAWQ-323

This issue was closed stating duplicate IP. I am trying to understand
if that can be the case.

There is no firewall between the segments. Nothing in blocking port
40000, 5432, 5437 or 5438. Segment starts up fine. psql to segments on
port 40000 works fine as well.

There is no error in the segments or master logs or startup logs.

I tried to integrate hawq with YARN and also using it's own
ResourceManager, but found similar behavior. Tried to set the
heartbeat interval to 10 seconds (hawq_rm_segment_heartbeat_interval =
10 in hawq-site.xml) but no change in the behavior.

Am I missing anything here? Has anyone found similar behavior before?


Regards,
Gagan Brahmi


Re: Master Segment Communication Issues

Posted by Vineet Goel <vg...@pivotal.io>.
Are you using vagrant or VMs, or are these physical machines?

Sometimes, the problem can result from the misleading IP address configuration of network card in virtual machine. Check if two segments have the same IP address in eth0.
You must specify different IP address of eth0 of different VMs.

Thanks
-Vineet



On May 3, 2016, at 8:30 AM, Gagan Brahmi <ga...@gmail.com> wrote:

Hi All,

I was looking to check if anyone has seen this behavior where segments
are not able to communicate with the master in HAWQ 2.0.

In a single node setup I don't see any problem with the segment and
master communication. The problem seems to be visible if there is two
or three machine involved in the hawq cluster.

The gp_segment_configuration reports the segment. It is reported up
for a few seconds and then it turns it down. If you execute any
queries the segment is no longer found in the
gp_segment_configuration.

Nothing can be found in gp_configuration or gp_configuration_history.
hawqstate reports "failures at master" for the segments.

I found a jira which had pretty much similar behavior (except for core
dumps which I haven't seen yet. Since I am not able run any queries.).
The jira in question is https://issues.apache.org/jira/browse/HAWQ-323

This issue was closed stating duplicate IP. I am trying to understand
if that can be the case.

There is no firewall between the segments. Nothing in blocking port
40000, 5432, 5437 or 5438. Segment starts up fine. psql to segments on
port 40000 works fine as well.

There is no error in the segments or master logs or startup logs.

I tried to integrate hawq with YARN and also using it's own
ResourceManager, but found similar behavior. Tried to set the
heartbeat interval to 10 seconds (hawq_rm_segment_heartbeat_interval =
10 in hawq-site.xml) but no change in the behavior.

Am I missing anything here? Has anyone found similar behavior before?


Regards,
Gagan Brahmi