You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafodion.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/09/16 00:07:00 UTC
[jira] [Commented] (TRAFODION-2746) Monitor exhibits memory
corruption in large cluster configuration > 30 nodes
[ https://issues.apache.org/jira/browse/TRAFODION-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168716#comment-16168716 ]
ASF GitHub Bot commented on TRAFODION-2746:
-------------------------------------------
GitHub user zcorrea opened a pull request:
https://github.com/apache/incubator-trafodion/pull/1234
[TRAFODION-2746] Fixed various problem detected in large clusters (> 30)
The problems were:
1. A segmentation violation occurred during the Integration phase, when the new
monitor is establishing the socket communication paths between itself and
the existing monitors.
a. Information is exchanged between the master (creator) monitor and the
slave (new) monitor process which tells the new monitor which nodes
monitor process make up the existing cluster instance. During these
exchanges, in CCluster::ReceiveSock() one of the messages was large
enough to require chunking and the logic which kept track of the
number of bytes received was not calculated correctly which resulted
in an overwrite past the boundary of the receive buffer.
2. A second segmentation violation was due to a buffer overwrite during the
Joining (revive) phase.
a. In requeue.cxx, when creating the buffer in the master (creator) monitor
which is populated with the cluster state information to be sent to the
slave (new) monitor process, the calculation did not properly account for
the number of logical and physical nodes. So that when the buffer was
populated, it would overwrite past the allocated buffer.
3. A third problem was also note in the one of the monitor would remain in
the Joining state and never come out of it.
a. The problem was in the order of logic when calling
CCluster::ResetIntegratingPNid() which triggers the
CCommAccept::commAcceptorSock() to accept another new node to
integrate. The invocation to ResetIntegratingPNid() was done before
resetting the creator flag. Due to kernel scheduling, this resetting
of the creator flag was happening after another monitor started the
Integration phase and it was breaking the node integration protocol
by terminating it too early. So the new monitor would stay in the
Joining state for ever since the protocol was broken.
4. The last segmentation violation was due to stderr buffer overwrite in
CRedirectStderr::handleOutput() where the size returned by snprintf()
was used to terminate the buffer containing stderr data >= 4096 which
is the size of the buffer.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/zcorrea/incubator-trafodion TRAFODION-2746
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-trafodion/pull/1234.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1234
----
commit 19555630d5c0d63e8a8ea1e02f92545da983cb35
Author: Zalo Correa <za...@esgyn.com>
Date: 2017-09-16T00:02:48Z
[TRAFODION-2746] Fixed various problem detected in large clusters (> 30)
----
> Monitor exhibits memory corruption in large cluster configuration > 30 nodes
> ----------------------------------------------------------------------------
>
> Key: TRAFODION-2746
> URL: https://issues.apache.org/jira/browse/TRAFODION-2746
> Project: Apache Trafodion
> Issue Type: Bug
> Components: foundation
> Affects Versions: 2.3-incubating
> Reporter: Gonzalo E Correa
> Assignee: Gonzalo E Correa
> Fix For: 2.3-incubating
>
>
> Found the following problems in the monitor when trying to bring up 120 nodes:
> 1. A segmentation violation occurred during the Integration phase, when the new monitor is establishing the socket communication paths between itself and the existing monitors.
> 2. A second segmentation violation was due to a buffer overwrite during the Joining (revive) phase.
> 3. One of the monitor would remain in the Joining state and never come out of it.
> 4. Stderr buffer overwrite in CRedirectStderr::handleOutput()
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)