You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafodion.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/09/16 00:07:00 UTC
[jira] [Commented] (TRAFODION-2746) Monitor exhibits memory corruption in large cluster configuration > 30 nodes

    [ https://issues.apache.org/jira/browse/TRAFODION-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168716#comment-16168716 ] 

ASF GitHub Bot commented on TRAFODION-2746:
-------------------------------------------

GitHub user zcorrea opened a pull request:

    https://github.com/apache/incubator-trafodion/pull/1234

    [TRAFODION-2746] Fixed various problem detected in large clusters (> 30)

    The problems were:
    
    1. A segmentation violation occurred during the Integration phase, when the new 
       monitor is establishing the socket communication paths between itself and 
       the existing monitors.
       a. Information is exchanged between the master (creator) monitor and the
          slave (new) monitor process which tells the new monitor which nodes
          monitor process make up the existing cluster instance. During these
          exchanges, in CCluster::ReceiveSock() one of the messages was large
          enough to require chunking and the logic which kept track of the
          number of bytes received was not calculated correctly which resulted
          in an overwrite past the boundary of the receive buffer. 
    2. A second segmentation violation was due to a buffer overwrite during the
       Joining (revive) phase.
       a. In requeue.cxx, when creating the buffer in the master (creator) monitor
          which is populated with the cluster state information to be sent to the
          slave (new) monitor process, the calculation did not properly account for
          the number of logical and physical nodes. So that when the buffer was
          populated, it would overwrite past the allocated buffer.
    3. A third problem was also note in the one of the monitor would remain in
       the Joining state and never come out of it.
       a. The problem was in the order of logic when calling 
          CCluster::ResetIntegratingPNid() which triggers the
          CCommAccept::commAcceptorSock() to accept another new node to
          integrate. The invocation to ResetIntegratingPNid() was done before
          resetting the creator flag. Due to kernel scheduling, this resetting
          of the creator flag was happening after another monitor started the
          Integration phase and it was breaking the node integration protocol
          by terminating it too early. So the new monitor would stay in the
          Joining state for ever since the protocol was broken.
    4. The last segmentation violation was due to stderr buffer overwrite in
       CRedirectStderr::handleOutput() where the size returned by snprintf() 
       was used to terminate the buffer containing stderr data >= 4096 which
       is the size of the buffer.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zcorrea/incubator-trafodion TRAFODION-2746

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-trafodion/pull/1234.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1234
    
----
commit 19555630d5c0d63e8a8ea1e02f92545da983cb35
Author: Zalo Correa <za...@esgyn.com>
Date:   2017-09-16T00:02:48Z

    [TRAFODION-2746] Fixed various problem detected in large clusters (> 30)

----


> Monitor exhibits memory corruption in large cluster configuration > 30 nodes
> ----------------------------------------------------------------------------
>
>                 Key: TRAFODION-2746
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2746
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: foundation
>    Affects Versions: 2.3-incubating
>            Reporter: Gonzalo E Correa
>            Assignee: Gonzalo E Correa
>             Fix For: 2.3-incubating
>
>
> Found the following problems in the monitor when trying to bring up 120 nodes:
> 1.	A segmentation violation occurred during the Integration phase, when the new monitor is establishing the socket communication paths between itself and the existing monitors.
> 2.	A second segmentation violation was due to a buffer overwrite during the Joining (revive) phase.
> 3.	One of the monitor would remain in the Joining state and never come out of it.
> 4.	Stderr buffer overwrite in CRedirectStderr::handleOutput()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)