You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by zentol <gi...@git.apache.org> on 2018/03/07 09:59:05 UTC

[GitHub] flink pull request #5652: [hotfix][tests] Do not use singleActorSystem in Lo...

GitHub user zentol opened a pull request:

    https://github.com/apache/flink/pull/5652

    [hotfix][tests] Do not use singleActorSystem in LocalFlinkMiniCluster

    ## What is the purpose of the change
    
    The legacy cluster started in {{MiniClusterResource}} used a single actor system, which rendered the returned {{ClusterClient}} unusable.
    
    This change will unfortunately cause tests to take longer, but i don't know how to fix this in another way.
    
    On every access you would get this exception below:
    ```
    org.apache.flink.client.program.ProgramInvocationException: Failed to retrieve the JobManager gateway.
    
        at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:513)
    
        at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:113)
    
    Caused by: org.apache.flink.util.FlinkException: Could not find out our own hostname by connecting to the leading JobManager. Please make sure that the Flink cluster has been started.
    
        at org.apache.flink.client.program.ClusterClient$LazyActorSystemLoader.get(ClusterClient.java:248)
    
        at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:923)
    
        at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:511)
    
        ... 30 more
    
    Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not find the connecting address by connecting to the current leader.
    
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.findConnectingAddress(LeaderRetrievalUtils.java:164)
    
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.findConnectingAddress(LeaderRetrievalUtils.java:145)
    
        at org.apache.flink.client.program.ClusterClient$LazyActorSystemLoader.get(ClusterClient.java:244)
    
        ... 32 more
    
    Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the connecting address to the current leader with the akka URL akka://flink/user/jobmanager_1.
    
        at org.apache.flink.runtime.net.ConnectionUtils$LeaderConnectingAddressListener.findConnectingAddress(ConnectionUtils.java:472)
    
        at org.apache.flink.runtime.net.ConnectionUtils$LeaderConnectingAddressListener.findConnectingAddress(ConnectionUtils.java:361)
    
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.findConnectingAddress(LeaderRetrievalUtils.java:162)
    
        ... 34 more
    
    Caused by: java.lang.Exception: Could not retrieve InetSocketAddress from Akka URL akka://flink/user/jobmanager_1
    
        at org.apache.flink.runtime.akka.AkkaUtils$.getInetSocketAddressFromAkkaURL(AkkaUtils.scala:709)
    
        at org.apache.flink.runtime.akka.AkkaUtils.getInetSocketAddressFromAkkaURL(AkkaUtils.scala)
    
        at org.apache.flink.runtime.net.ConnectionUtils$LeaderConnectingAddressListener.findConnectingAddress(ConnectionUtils.java:392)
    
        ... 36 more
    ```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zentol/flink hotfix_single

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5652.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5652
    
----
commit 6a105cbb194b87dec98224b985ee5ceb9239d492
Author: zentol <ch...@...>
Date:   2018-03-05T12:45:33Z

    [hotfix][tests] Do not use singleActorSystem in LocalFlinkMiniCluster
    
    Using a singleActorSystem rendered the returned client unusable.

----


---

[GitHub] flink issue #5652: [hotfix][tests] Do not use singleActorSystem in LocalFlin...

Posted by zentol <gi...@git.apache.org>.
Github user zentol commented on the issue:

    https://github.com/apache/flink/pull/5652
  
    All legacy tests going through the `MiniClusterResource` will take longer. I don't know by how much, but we now have to start multiple actor systems and the JM<->TM communication is no longer local.


---

[GitHub] flink issue #5652: [hotfix][tests] Do not use singleActorSystem in LocalFlin...

Posted by zentol <gi...@git.apache.org>.
Github user zentol commented on the issue:

    https://github.com/apache/flink/pull/5652
  
    The alternative would be to make the `ClusterClient` functionality optional and force tests to explicitly enable it.


---

[GitHub] flink issue #5652: [hotfix][tests] Do not use singleActorSystem in LocalFlin...

Posted by aljoscha <gi...@git.apache.org>.
Github user aljoscha commented on the issue:

    https://github.com/apache/flink/pull/5652
  
    Tests taking longer will be true for all tests or only those that use the `ClusterClient`? What increase in time are we talking about?



---

[GitHub] flink issue #5652: [hotfix][tests] Do not use singleActorSystem in LocalFlin...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/5652
  
    Would it work to have a "Flink Service" resource interface to which you can submit jobs?
    
    It may be backed by a cluster client or directly by the mini cluster, which executes jobs directly. Having the shared interface (across flip6 and legacy) based on the cluster client seems like the wrong common abstraction.


---

[GitHub] flink issue #5652: [hotfix][tests] Do not use singleActorSystem in LocalFlin...

Posted by aljoscha <gi...@git.apache.org>.
Github user aljoscha commented on the issue:

    https://github.com/apache/flink/pull/5652
  
    @zentol To unblock this, I'd propose to add another constructor to `MiniClusterResource` that takes a `enableClusterClient` parameter. Only if that is true do we start in multi-actor-system mode. WDYT?


---

[GitHub] flink issue #5652: [hotfix][tests] Do not use singleActorSystem in LocalFlin...

Posted by aljoscha <gi...@git.apache.org>.
Github user aljoscha commented on the issue:

    https://github.com/apache/flink/pull/5652
  
    Could change `MiniClusterResource` to expect a `needsClusterClient()` parameter or whatnot and normally start in single-actor-system mode. That's probably what you had in mind ... 😅 


---

[GitHub] flink pull request #5652: [hotfix][tests] Do not use singleActorSystem in Lo...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/5652


---

[GitHub] flink issue #5652: [hotfix][tests] Do not use singleActorSystem in LocalFlin...

Posted by aljoscha <gi...@git.apache.org>.
Github user aljoscha commented on the issue:

    https://github.com/apache/flink/pull/5652
  
    @StephanEwen I think that's what the interface already does. For example, `MiniClusterClient.submitJob()` does a `miniCluster.runDetached(jobGraph)`, and `MiniClusterClient.cancel()` does `miniCluster.cancelJob(jobId)`. The problem is that the legacy cluster does not have methods for those things, namely "cancel", "get job status", "get accumulators", and "savepoint". All existing ITCases use custom Akka communication with the testing cluster. We can either add methods for all that to the legacy mini cluster (that would probably also use Akka) or use the `StandaloneClusterClient`, which also uses Akka. But for those Akka messages to work we can't run it in single-actor-system mode. WDYT?


---

[GitHub] flink issue #5652: [hotfix][tests] Do not use singleActorSystem in LocalFlin...

Posted by aljoscha <gi...@git.apache.org>.
Github user aljoscha commented on the issue:

    https://github.com/apache/flink/pull/5652
  
    We will see when we get the results from Travis for this one, right?



---

[GitHub] flink issue #5652: [hotfix][tests] Do not use singleActorSystem in LocalFlin...

Posted by zentol <gi...@git.apache.org>.
Github user zentol commented on the issue:

    https://github.com/apache/flink/pull/5652
  
    yup. But one profile is already scratching the 50m limit as is :/


---