You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafodion.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/02/06 20:40:00 UTC
[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

    [ https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354514#comment-16354514 ] 

ASF GitHub Bot commented on TRAFODION-2940:
-------------------------------------------

Github user DaveBirdsall commented on a diff in the pull request:

    https://github.com/apache/trafodion/pull/1427#discussion_r166434382
  
    --- Diff: dcs/src/main/java/org/trafodion/dcs/master/DcsMaster.java ---
    @@ -111,11 +104,59 @@ public DcsMaster(String[] args) {
             trafodionHome = System.getProperty(Constants.DCS_TRAFODION_HOME);
             jvmShutdownHook = new JVMShutdownHook();
             Runtime.getRuntime().addShutdownHook(jvmShutdownHook);
    -        thrd = new Thread(this);
    -        thrd.start();
    +
    +        ExecutorService executorService = Executors.newFixedThreadPool(1);
    +        CompletionService<Integer> completionService = new ExecutorCompletionService<Integer>(executorService);
    +
    +        while (true) {
    +            completionService.submit(this);
    +            Future<Integer> f = null;
    +            try {
    +                f = completionService.take();
    +                if (f != null) {
    +                    Integer status = f.get();
    +                    if (status <= 0) {
    +                        System.exit(status);
    +                    } else {
    +                        // 35000 * 15mins ~= 1 years
    +                        RetryCounter retryCounter = RetryCounterFactory.create(35000, 15, TimeUnit.MINUTES);
    +                        while (true) {
    +                            try {
    +                                ZkClient tmpZkc = new ZkClient();
    +                                tmpZkc.connect();
    +                                tmpZkc.close();
    +                                tmpZkc = null;
    +                                LOG.info("Connected to ZooKeeper successful, restart DCS Master.");
    +                                // reset lock
    +                                isLeader = new CountDownLatch(1);
    +                                break;
    --- End diff --
    
    I'm not sure I understand this logic. Do we sit inside one of these method calls during normal processing? Or does tmpZkc.connect() and close() complete immediately? If so, it looks like we just loop around the while loop and do it again, over and over.


> In HA env, one node lose network, when recover, trafci can't use
> ----------------------------------------------------------------
>
>                 Key: TRAFODION-2940
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2940
>             Project: Apache Trafodion
>          Issue Type: Bug
>    Affects Versions: any
>            Reporter: mashengchen
>            Assignee: mashengchen
>            Priority: Major
>             Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, there will have two floating ip, two working dcs master, and trafci can't be use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)