You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Juan Gentile <j....@criteo.com> on 2018/10/31 14:05:27 UTC

1.6 UI issues

Hello!

We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either it takes too long to load the information of the job or it never loads at all.

Has anyone had this issue? Any clues as to why?

Thank you,
Juan

Re: 1.6 UI issues

Posted by Oleksandr Nitavskyi <o....@criteo.com>.

Hi again here,


So I have created two Jira issues: https://issues.apache.org/jira/browse/FLINK-11394 about UI problem and https://issues.apache.org/jira/browse/FLINK-11396 related to the GC pressure in MetricsStore, let's continue the technical discussion there.


As a workaround for GC pressure can be the usage of more predictable GC that G1 with ergonomics. We have switched to Parallel GC for JM and hope it will be good enough for all our use-cases. While on the TM side we still prefer to use G1 due to the latency promises it has.


Cheers

Oleksandr

________________________________
From: Till Rohrmann <tr...@apache.org>
Sent: Thursday, January 10, 2019 6:27:10 PM
To: Oleksandr Nitavskyi
Cc: user@flink.apache.org; dwysakowicz@apache.org; Jeff Bean; Jérôme Viveret; Juan Gentile
Subject: Re: 1.6 UI issues

Hi Oleksandr,

thanks a lot for the kind wishes and for the detailed investigation.

1. I think if the cluster cannot serve the information within the web.refresh-interval, it would be best to increase it. I quickly looked into the `ExecutionGraphCache` which is used for storing the `ArchivedExecutionGraph` and it looks one could change the logic a bit. What we do at the moment is to invalidate the ExecutionGraph cache entries after the web.refresh-interval and request an update from the cluster. This has the benefit (given that the response is fast) that we see faster the updated state. Instead one could also invalidate the old ExecutionGraph cache entry only after the response for the new request has arrived. This would prevent your situation because you would keep the old state as long as the request is in flight. The downside of this approach would be that you might wait another UI refresh interval until you see the results if the response is very fast. For that you could open a JIRA issue to further discuss it.

2. The high load caused by the MetricStore is indeed a problem. For that we should also open a JIRA issue to investigate what we could improve here. One thing we should definitely do is to make the fetching interval configurable so that one doesn't have to recompile Flink in order to change it. I actually quickly added it [1,2].

Thanks a lot for your help with debugging the problems!

[1] https://github.com/apache/flink/pull/7459<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_pull_7459&d=DwMFaQ&c=nxfEpP1JWHVKAq835DW4mA&r=W7rqNhZq2h1ZXgsnJEbdIqEFWu25yHZooMreg1Eos7g&m=xoJWb80rBWYRrw9G59tlnorhgggqAW1WoicJ6nTNVek&s=ArIKAJ7cA-ywNZvx0HQTHkmFLLxvBw38iuGLkKFiBWE&e=>
[2] https://issues.apache.org/jira/browse/FLINK-11300<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D11300&d=DwMFaQ&c=nxfEpP1JWHVKAq835DW4mA&r=W7rqNhZq2h1ZXgsnJEbdIqEFWu25yHZooMreg1Eos7g&m=xoJWb80rBWYRrw9G59tlnorhgggqAW1WoicJ6nTNVek&s=2dJOv8EY_RXHpk7lgtLnHhR6bGfNaC0q9wCsyDCEmUA&e=>

Cheers,
Till

On Thu, Jan 10, 2019 at 10:08 AM Oleksandr Nitavskyi <o....@criteo.com>> wrote:

Hello Till,



First congratulations to you and the whole Flink community! It is great to see such success and recognition of the Apache Flink and your work.



Thanks also for the previous answer and good tips. On our side we have made several more steps in understanding the issue.



So I think we have two related problems in Flink, which can be reproduced in our set up:

  1.  UI issue

Looks like there are some routing problems on Angular side in Flink UI. Angular refreshes job state (which is 20 kb in our case) every 10 sec by default (web.refresh-interval).



[cid:image001.png@01D4A674.134F5390]



[cid:image002.png@01D4A674.134F5390]



In case one of refresh calls take more than web.refresh-interval next request is made.



[cid:image003.png@01D4A674.134F5390]



After a while first requests started to complete, but UI is not rendered correctly in this case



[cid:image004.png@01D4A674.134F5390]



Only name tabs are shown and no graph, not metrics were requested and rendered. What do you think if I create a Jira bug for this issue?



  1.  Second issue is the reason why we observe such behavior. After some profiling in JVisualVM and JMC, looks like the hot spot for us is adding Metrics into the HashMap.



In tested set up we had 60 Task managers and on every Task Manager we get 6114 metrics (Operators * metrics amount), which has created 366840 inserts per 10 seconds, which means 36k inserts per second. The problem is that in case of small refresh interval a lot of requests from UI DDOS back-end system in our case.



If you think it is interesting I can share profiler snapshots with you. The most interesting part the hot methods:



Stack Trace                                                                                                                                                                             Sample Count    Percentage(%)

org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addMetric(Map, String, MetricDump)   709                                79.395

java.util.concurrent.ConcurrentHashMap.putVal(Object, Object, boolean)                                              595                               66.629

sun.misc.FloatingDecimal.toJavaFormatString(double)                                                                                 89                               9.966



Also a lot of CPU wasted in New GC again in addMetric method:

Stack Trace                                                                                                                                                                                         TLABs    Total TLAB Size(bytes)   Pressure(%)

org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addMetric(Map, String, MetricDump)   2,537     1,791,614,312                   61.372



Increasing interval in MetricFetcher#update by recompiling Flink improves UI responsiveness.



Also we are using G1 garbage collector for our Job Manager which has 8 Gb of the heap. What we have noticed, that young GC takes very significant amount of time, specially during the Scan RS phase. Is there any recommendation from the community about GC algorithm we should use for JobManager (and TaskManager)?



Thank you

Kind Regards

Oleksandr



From: Till Rohrmann <tr...@apache.org>>
Date: Wednesday 2 January 2019 at 14:34
To: Oleksandr Nitavskyi <o....@criteo.com>>
Cc: "user@flink.apache.org<ma...@flink.apache.org>" <us...@flink.apache.org>>, "dwysakowicz@apache.org<ma...@apache.org>" <dw...@apache.org>>, Jeff Bean <je...@data-artisans.com>>, Jérôme Viveret <j....@criteo.com>>, Juan Gentile <j....@criteo.com>>
Subject: Re: 1.6 UI issues



Hi Oleksandr,



the requestJob call should only take longer if either the `JobMaster` is overloaded and too busy to respond to the request or if the ArchivedExecutionGraph is very large (e.g. very large accumulators) and generating it and sending it over to the RestServerEndpoint takes too long. This is also the change which was introduced with Flink 1.5. Instead of simply handing over a reference to the RestServerEndpoint from the JobMaster, the ArchivedExecutionGraph now needs to be sent through the network stack to the RestServerEndpoint.



If you did not change the akka.framesize then the maximum size of the ArchivedExecutionGraph should only be 10 MB, though. Therefore, I would guess that your `JobMaster` must be quite busy if the requests time out.



Cheers,

Till



On Wed, Jan 2, 2019 at 10:58 AM Oleksandr Nitavskyi <o....@criteo.com>> wrote:

Hello guys. Happy new year!



Context: we started to have some troubles with UI after bumping our Flink version from 1.4 to 1.6.3. UI couldn’t render Job details page, so inspecting of the jobs for us has become impossible with the new version.



And looks like we have a workaround for our UI issue.

After some investigation we realized that starting from Flink 1.5 version we started to have a timeout on the actor call: restfulGateway.requestJob(jobId, timeout) in ExecutionGraphCache. So we have increased web.timeout parameter and we have stopped to have timeout exception on the JobManager side.



Also in SingleJobController on the Angular JS side we needed to tweak web.refresh-interval in order to ensure that Front-End is waiting for back-end request to be finished. Otherwise Angular JS side can make another request in SingleJobController and don’t know why when older request is finished no UI has been changed. We will have a look closer on this behavior.



Does it ring a bell for you probably?



Thank you



Kind Regards

Oleksandr



From: Till Rohrmann <tr...@apache.org>>
Date: Wednesday 19 December 2018 at 16:52
To: Juan Gentile <j....@criteo.com>>
Cc: "dwysakowicz@apache.org<ma...@apache.org>" <dw...@apache.org>>, Jeff Bean <je...@data-artisans.com>>, Oleksandr Nitavskyi <o....@criteo.com>>
Subject: Re: 1.6 UI issues



Hi Juan,



thanks for the log. The log file does not contain anything suspicious. Are you sure that you sent me the right file? The timestamps don't seem to match. In the attached log, the job seems to run without problems.



Cheers,

Till



On Wed, Dec 19, 2018 at 10:26 AM Juan Gentile <j....@criteo.com>> wrote:

Hello Till, Dawid

Sorry for the late response on this issue and thank you Jeff for helping us with this.

Yes we are using 1.6.2

I attach the logs from the Job Master.

Also we noticed this exception:

2018-12-19 08:50:10,497 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler   - Implementation error: Unhandled exception.

java.util.concurrent.CancellationException

    at java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2263)

    at org.apache.flink.runtime.rest.handler.legacy.ExecutionGraphCache.getExecutionGraph(ExecutionGraphCache.java:124)

    at org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.handleRequest(AbstractExecutionGraphHandler.java:76)

    at org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:78)

    at org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:154)

    at org.apache.flink.runtime.rest.handler.RedirectHandler.lambda$null$0(RedirectHandler.java:142)

    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)

    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

    at java.lang.Thread.run(Thread.java:748)

2018-12-19 08:50:17,977 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Implementation error: Unhandled exception.

akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-760166654]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

    at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

    at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

    at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

    at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

    at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

    at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

    at java.lang.Thread.run(Thread.java:748)



For which we tested with this parameter: -Dakka.ask.timeout=60s

But the issue remains.



Thank you

Juan



From: Till Rohrmann <tr...@apache.org>>
Date: Thursday, 8 November 2018 at 16:06
To: "dwysakowicz@apache.org<ma...@apache.org>" <dw...@apache.org>>
Cc: Juan Gentile <j....@criteo.com>>, "myasuka@live.com<ma...@live.com>" <my...@live.com>>, user <us...@flink.apache.org>>
Subject: Re: 1.6 UI issues



Hi Juan,



could you share the cluster entrypoint logs with us? They should contain more information about the internal server error.



Just to make sure, you are using Flink 1.6.2, right?



Cheers,

Till



On Thu, Nov 8, 2018 at 3:29 PM Dawid Wysakowicz <dw...@apache.org>> wrote:

Hi Juan,

It doesn't look similar to the issue linked to me. What cluster setup are you using? Are you running HA mode?

I am adding Till to cc, who might be able to help you more.

Best,

Dawid

On 02/11/2018 17:26, Juan Gentile wrote:

Hello Yun,



We haven’t seen the error in the log as you mentioned. We also checked the GC and it seems to be okay. Inspecting the UI we found the following error:
Error! Filename not specified.



{"errors":["Could not retrieve the redirect address of the current leader. Please try to refresh."]}

Error! Filename not specified.



We suspect we are running into the same issue as described here (http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/akka-timeout-td14996.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_akka-2Dtimeout-2Dtd14996.html&d=DwMFaQ&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=hidCcXGD2aiyfZuADm1v4XCzvKL2Rsww7WJWxETtxRY&s=zMBuP5aTcdQ5VMavXw1dGvz72efTyTSq6tpbFcPSHxU&e=>) but we are not so sure.



Have you encountered this issue before?



Thank you,



From: Yun Tang <my...@live.com>
Date: Thursday, 1 November 2018 at 12:31
To: Juan Gentile <j....@criteo.com>, "user@flink.apache.org"<ma...@flink.apache.org> <us...@flink.apache.org>
Subject: Re: 1.6 UI issues



Hi Juan



From our experience, you could check the jobmanager.log first to see whether existing similar logs below:

max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytes



If you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].



Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.

[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>

Apache Flink 1.6 Documentation: Configuration<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>

Key Default Description; jobmanager.heap.size "1024m" JVM heap size for the JobManager. taskmanager.heap.size "1024m" JVM heap size for the TaskManagers, which are the parallel workers of the system.

ci.apache.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__ci.apache.org&d=DwMFaQ&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=hidCcXGD2aiyfZuADm1v4XCzvKL2Rsww7WJWxETtxRY&s=JGgSIUxh1k57R0OSjnAG8GxwbzWUo6MFercAY-3JL3k&e=>


Best

Yun Tang



________________________________

From: Juan Gentile <j....@criteo.com>
Sent: Wednesday, October 31, 2018 22:05
To: user@flink.apache.org<ma...@flink.apache.org>
Subject: 1.6 UI issues



Hello!



We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either it takes too long to load the information of the job or it never loads at all.



Has anyone had this issue? Any clues as to why?



Thank you,

Juan

Re: 1.6 UI issues

Posted by Till Rohrmann <tr...@apache.org>.

Hi Oleksandr,

thanks a lot for the kind wishes and for the detailed investigation.

1. I think if the cluster cannot serve the information within the
web.refresh-interval, it would be best to increase it. I quickly looked
into the `ExecutionGraphCache` which is used for storing the
`ArchivedExecutionGraph` and it looks one could change the logic a bit.
What we do at the moment is to invalidate the ExecutionGraph cache entries
after the web.refresh-interval and request an update from the cluster. This
has the benefit (given that the response is fast) that we see faster the
updated state. Instead one could also invalidate the old ExecutionGraph
cache entry only after the response for the new request has arrived. This
would prevent your situation because you would keep the old state as long
as the request is in flight. The downside of this approach would be that
you might wait another UI refresh interval until you see the results if the
response is very fast. For that you could open a JIRA issue to further
discuss it.

2. The high load caused by the MetricStore is indeed a problem. For that we
should also open a JIRA issue to investigate what we could improve here.
One thing we should definitely do is to make the fetching interval
configurable so that one doesn't have to recompile Flink in order to change
it. I actually quickly added it [1,2].

Thanks a lot for your help with debugging the problems!

[1] https://github.com/apache/flink/pull/7459
[2] https://issues.apache.org/jira/browse/FLINK-11300

Cheers,
Till

On Thu, Jan 10, 2019 at 10:08 AM Oleksandr Nitavskyi <o....@criteo.com>
wrote:

> Hello Till,
>
>
>
> First congratulations to you and the whole Flink community! It is great to
> see such success and recognition of the Apache Flink and your work.
>
>
>
> Thanks also for the previous answer and good tips. On our side we have
> made several more steps in understanding the issue.
>
>
>
> So I think we have two related problems in Flink, which can be reproduced
> in our set up:
>
>    1. UI issue
>
> Looks like there are some routing problems on Angular side in Flink UI.
> Angular refreshes job state (which is 20 kb in our case) every 10 sec by
> default (web.refresh-interval).
>
>
>
> [image: cid:image001.png@01D4A674.134F5390]
>
>
>
> [image: cid:image002.png@01D4A674.134F5390]
>
>
>
> In case one of refresh calls take more than web.refresh-interval next
> request is made.
>
>
>
> [image: cid:image003.png@01D4A674.134F5390]
>
>
>
> After a while first requests started to complete, but UI is not rendered
> correctly in this case
>
>
>
> [image: cid:image004.png@01D4A674.134F5390]
>
>
>
> Only name tabs are shown and no graph, not metrics were requested and
> rendered. *What do you think if I create a Jira bug for this issue?*
>
>
>
>    1. Second issue is the reason why we observe such behavior. After some
>    profiling in JVisualVM and JMC, looks like the hot spot for us is adding
>    Metrics into the HashMap.
>
>
>
> In tested set up we had 60 Task managers and on every Task Manager we get
> 6114 metrics (Operators * metrics amount), which has created 366840 inserts
> per 10 seconds, which means 36k inserts per second. The problem is that in
> case of small refresh interval a lot of requests from UI DDOS back-end
> system in our case.
>
>
>
> If you think it is interesting I can share profiler snapshots with you.
> The most interesting part the hot methods:
>
>
>
> *Stack Trace
>
>                                                  Sample Count
> Percentage(%)*
>
> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addMetric(Map,
> String, MetricDump)   709                                79.395
>
> java.util.concurrent.ConcurrentHashMap.putVal(Object, Object,
> boolean)                                              595
>                66.629
>
> sun.misc.FloatingDecimal.toJavaFormatString(double)
>
>  89                               9.966
>
>
>
> Also a lot of CPU wasted in New GC again in addMetric method:
>
> *Stack
> Trace
>                 TLABs    Total TLAB Size(bytes)   Pressure(%)*
>
> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addMetric(Map,
> String, MetricDump)   2,537     1,791,614,312                   61.372
>
>
>
> Increasing interval in MetricFetcher#update by recompiling Flink improves
> UI responsiveness.
>
>
>
> Also we are using G1 garbage collector for our Job Manager which has 8 Gb
> of the heap. What we have noticed, that young GC takes very significant
> amount of time, specially during the *Scan RS* phase. Is there any
> recommendation from the community about GC algorithm we should use for
> JobManager (and TaskManager)?
>
>
>
> Thank you
>
> Kind Regards
>
> Oleksandr
>
>
>
> *From: *Till Rohrmann <tr...@apache.org>
> *Date: *Wednesday 2 January 2019 at 14:34
> *To: *Oleksandr Nitavskyi <o....@criteo.com>
> *Cc: *"user@flink.apache.org" <us...@flink.apache.org>, "
> dwysakowicz@apache.org" <dw...@apache.org>, Jeff Bean <
> jeff@data-artisans.com>, Jérôme Viveret <j....@criteo.com>, Juan
> Gentile <j....@criteo.com>
> *Subject: *Re: 1.6 UI issues
>
>
>
> Hi Oleksandr,
>
>
>
> the requestJob call should only take longer if either the `JobMaster` is
> overloaded and too busy to respond to the request or if the
> ArchivedExecutionGraph is very large (e.g. very large accumulators) and
> generating it and sending it over to the RestServerEndpoint takes too long.
> This is also the change which was introduced with Flink 1.5. Instead of
> simply handing over a reference to the RestServerEndpoint from the
> JobMaster, the ArchivedExecutionGraph now needs to be sent through the
> network stack to the RestServerEndpoint.
>
>
>
> If you did not change the akka.framesize then the maximum size of the
> ArchivedExecutionGraph should only be 10 MB, though. Therefore, I would
> guess that your `JobMaster` must be quite busy if the requests time out.
>
>
>
> Cheers,
>
> Till
>
>
>
> On Wed, Jan 2, 2019 at 10:58 AM Oleksandr Nitavskyi <
> o.nitavskyi@criteo.com> wrote:
>
> Hello guys. Happy new year!
>
>
>
> Context: we started to have some troubles with UI after bumping our Flink
> version from 1.4 to 1.6.3. UI couldn’t render Job details page, so
> inspecting of the jobs for us has become impossible with the new version.
>
>
>
> And looks like we have a workaround for our UI issue.
>
> After some investigation we realized that starting from Flink 1.5 version
> we started to have a timeout on the actor call: *restfulGateway.requestJob(jobId,
> timeout)* in *ExecutionGraphCache*. So we have increased *web.timeout*
> parameter and we have stopped to have timeout exception on the JobManager
> side.
>
>
>
> Also in *SingleJobController* on the Angular JS side we needed to tweak
> *web.refresh-interval* in order to ensure that Front-End is waiting for
> back-end request to be finished. Otherwise Angular JS side can make another
> request in SingleJobController and don’t know why when older request is
> finished no UI has been changed. We will have a look closer on this
> behavior.
>
>
>
> Does it ring a bell for you probably?
>
>
>
> Thank you
>
>
>
> Kind Regards
>
> Oleksandr
>
>
>
> *From: *Till Rohrmann <tr...@apache.org>
> *Date: *Wednesday 19 December 2018 at 16:52
> *To: *Juan Gentile <j....@criteo.com>
> *Cc: *"dwysakowicz@apache.org" <dw...@apache.org>, Jeff Bean <
> jeff@data-artisans.com>, Oleksandr Nitavskyi <o....@criteo.com>
> *Subject: *Re: 1.6 UI issues
>
>
>
> Hi Juan,
>
>
>
> thanks for the log. The log file does not contain anything suspicious. Are
> you sure that you sent me the right file? The timestamps don't seem to
> match. In the attached log, the job seems to run without problems.
>
>
>
> Cheers,
>
> Till
>
>
>
> On Wed, Dec 19, 2018 at 10:26 AM Juan Gentile <j....@criteo.com>
> wrote:
>
> Hello Till, Dawid
>
> Sorry for the late response on this issue and thank you Jeff for helping
> us with this.
>
> Yes we are using 1.6.2
>
> I attach the logs from the Job Master.
>
> Also we noticed this exception:
>
> 2018-12-19 08:50:10,497 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler   -
> Implementation error: Unhandled exception.
>
> java.util.concurrent.CancellationException
>
>     at
> java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2263)
>
>     at
> org.apache.flink.runtime.rest.handler.legacy.ExecutionGraphCache.getExecutionGraph(ExecutionGraphCache.java:124)
>
>     at
> org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.handleRequest(AbstractExecutionGraphHandler.java:76)
>
>     at
> org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:78)
>
>     at
> org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:154)
>
>     at
> org.apache.flink.runtime.rest.handler.RedirectHandler.lambda$null$0(RedirectHandler.java:142)
>
>     at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>
>     at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>
>     at
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>
>     at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
>
>     at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
>
>     at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
>
>     at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
>
>     at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>
>     at java.lang.Thread.run(Thread.java:748)
>
> 2018-12-19 08:50:17,977 ERROR
> org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  -
> Implementation error: Unhandled exception.
>
> akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka://flink/user/dispatcher#-760166654]] after [10000 ms].
> Sender[null] sent message of type
> "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
>
>     at
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
>
>     at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>
>     at
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>
>     at
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>
>     at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>
>     at
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>
>     at
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>
>     at
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>
>     at
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>
>     at java.lang.Thread.run(Thread.java:748)
>
>
>
> For which we tested with this parameter: -Dakka.ask.timeout=60s
>
> But the issue remains.
>
>
>
> Thank you
>
> Juan
>
>
>
> *From: *Till Rohrmann <tr...@apache.org>
> *Date: *Thursday, 8 November 2018 at 16:06
> *To: *"dwysakowicz@apache.org" <dw...@apache.org>
> *Cc: *Juan Gentile <j....@criteo.com>, "myasuka@live.com" <
> myasuka@live.com>, user <us...@flink.apache.org>
> *Subject: *Re: 1.6 UI issues
>
>
>
> Hi Juan,
>
>
>
> could you share the cluster entrypoint logs with us? They should contain
> more information about the internal server error.
>
>
>
> Just to make sure, you are using Flink 1.6.2, right?
>
>
>
> Cheers,
>
> Till
>
>
>
> On Thu, Nov 8, 2018 at 3:29 PM Dawid Wysakowicz <dw...@apache.org>
> wrote:
>
> Hi Juan,
>
> It doesn't look similar to the issue linked to me. What cluster setup are
> you using? Are you running HA mode?
>
> I am adding Till to cc, who might be able to help you more.
>
> Best,
>
> Dawid
>
> On 02/11/2018 17:26, Juan Gentile wrote:
>
> Hello Yun,
>
>
>
> We haven’t seen the error in the log as you mentioned. We also checked the
> GC and it seems to be okay. Inspecting the UI we found the following error:
> *Error! Filename not specified.*
>
>
>
> {"errors":["Could not retrieve the redirect address of the current leader.
> Please try to refresh."]}
>
> *Error! Filename not specified.*
>
>
>
> We suspect we are running into the same issue as described here (
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/akka-timeout-td14996.html
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_akka-2Dtimeout-2Dtd14996.html&d=DwMFaQ&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=hidCcXGD2aiyfZuADm1v4XCzvKL2Rsww7WJWxETtxRY&s=zMBuP5aTcdQ5VMavXw1dGvz72efTyTSq6tpbFcPSHxU&e=>)
> but we are not so sure.
>
>
>
> Have you encountered this issue before?
>
>
>
> Thank you,
>
>
>
> *From: *Yun Tang <my...@live.com> <my...@live.com>
> *Date: *Thursday, 1 November 2018 at 12:31
> *To: *Juan Gentile <j....@criteo.com> <j....@criteo.com>,
> "user@flink.apache.org" <us...@flink.apache.org> <us...@flink.apache.org>
> <us...@flink.apache.org>
> *Subject: *Re: 1.6 UI issues
>
>
>
> Hi Juan
>
>
>
> From our experience, you could check the jobmanager.log first to see
> whether existing similar logs below:
>
> max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytes
>
>
>
> If you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].
>
>
>
> Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>
>
> Apache Flink 1.6 Documentation: Configuration
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>
>
> Key Default Description; jobmanager.heap.size "1024m" JVM heap size for
> the JobManager. taskmanager.heap.size "1024m" JVM heap size for the
> TaskManagers, which are the parallel workers of the system.
>
> ci.apache.org
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ci.apache.org&d=DwMFaQ&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=hidCcXGD2aiyfZuADm1v4XCzvKL2Rsww7WJWxETtxRY&s=JGgSIUxh1k57R0OSjnAG8GxwbzWUo6MFercAY-3JL3k&e=>
>
> Best
>
> Yun Tang
>
>
> ------------------------------
>
> *From:* Juan Gentile <j....@criteo.com> <j....@criteo.com>
> *Sent:* Wednesday, October 31, 2018 22:05
> *To:* user@flink.apache.org
> *Subject:* 1.6 UI issues
>
>
>
> Hello!
>
>
>
> We are migrating the the last 1.6 version and all the jobs seem to work
> fine, but when we check individual jobs through the web interface we
> encounter the issue that after clicking on a job, either it takes too long
> to load the information of the job or it never loads at all.
>
>
>
> Has anyone had this issue? Any clues as to why?
>
>
>
> Thank you,
>
> Juan
>
>

Re: 1.6 UI issues

Posted by Oleksandr Nitavskyi <o....@criteo.com>.

Hello Till,

First congratulations to you and the whole Flink community! It is great to see such success and recognition of the Apache Flink and your work.

Thanks also for the previous answer and good tips. On our side we have made several more steps in understanding the issue.

So I think we have two related problems in Flink, which can be reproduced in our set up:

  1.  UI issue

Looks like there are some routing problems on Angular side in Flink UI. Angular refreshes job state (which is 20 kb in our case) every 10 sec by default (web.refresh-interval).



[cid:image001.png@01D4A674.134F5390]



[cid:image002.png@01D4A674.134F5390]



In case one of refresh calls take more than web.refresh-interval next request is made.



[cid:image003.png@01D4A674.134F5390]



After a while first requests started to complete, but UI is not rendered correctly in this case



[cid:image004.png@01D4A674.134F5390]



Only name tabs are shown and no graph, not metrics were requested and rendered. What do you think if I create a Jira bug for this issue?



  1.  Second issue is the reason why we observe such behavior. After some profiling in JVisualVM and JMC, looks like the hot spot for us is adding Metrics into the HashMap.

In tested set up we had 60 Task managers and on every Task Manager we get 6114 metrics (Operators * metrics amount), which has created 366840 inserts per 10 seconds, which means 36k inserts per second. The problem is that in case of small refresh interval a lot of requests from UI DDOS back-end system in our case.

If you think it is interesting I can share profiler snapshots with you. The most interesting part the hot methods:

Stack Trace                                                                                                                                                                             Sample Count    Percentage(%)
org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addMetric(Map, String, MetricDump)   709                                79.395
java.util.concurrent.ConcurrentHashMap.putVal(Object, Object, boolean)                                              595                               66.629
sun.misc.FloatingDecimal.toJavaFormatString(double)                                                                                 89                               9.966

Also a lot of CPU wasted in New GC again in addMetric method:
Stack Trace                                                                                                                                                                                         TLABs    Total TLAB Size(bytes)   Pressure(%)
org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addMetric(Map, String, MetricDump)   2,537     1,791,614,312                   61.372

Increasing interval in MetricFetcher#update by recompiling Flink improves UI responsiveness.

Also we are using G1 garbage collector for our Job Manager which has 8 Gb of the heap. What we have noticed, that young GC takes very significant amount of time, specially during the Scan RS phase. Is there any recommendation from the community about GC algorithm we should use for JobManager (and TaskManager)?

Thank you
Kind Regards
Oleksandr

From: Till Rohrmann <tr...@apache.org>
Date: Wednesday 2 January 2019 at 14:34
To: Oleksandr Nitavskyi <o....@criteo.com>
Cc: "user@flink.apache.org" <us...@flink.apache.org>, "dwysakowicz@apache.org" <dw...@apache.org>, Jeff Bean <je...@data-artisans.com>, Jérôme Viveret <j....@criteo.com>, Juan Gentile <j....@criteo.com>
Subject: Re: 1.6 UI issues

Hi Oleksandr,

the requestJob call should only take longer if either the `JobMaster` is overloaded and too busy to respond to the request or if the ArchivedExecutionGraph is very large (e.g. very large accumulators) and generating it and sending it over to the RestServerEndpoint takes too long. This is also the change which was introduced with Flink 1.5. Instead of simply handing over a reference to the RestServerEndpoint from the JobMaster, the ArchivedExecutionGraph now needs to be sent through the network stack to the RestServerEndpoint.

If you did not change the akka.framesize then the maximum size of the ArchivedExecutionGraph should only be 10 MB, though. Therefore, I would guess that your `JobMaster` must be quite busy if the requests time out.

Cheers,
Till

On Wed, Jan 2, 2019 at 10:58 AM Oleksandr Nitavskyi <o....@criteo.com>> wrote:
Hello guys. Happy new year!

Context: we started to have some troubles with UI after bumping our Flink version from 1.4 to 1.6.3. UI couldn’t render Job details page, so inspecting of the jobs for us has become impossible with the new version.

And looks like we have a workaround for our UI issue.
After some investigation we realized that starting from Flink 1.5 version we started to have a timeout on the actor call: restfulGateway.requestJob(jobId, timeout) in ExecutionGraphCache. So we have increased web.timeout parameter and we have stopped to have timeout exception on the JobManager side.

Also in SingleJobController on the Angular JS side we needed to tweak web.refresh-interval in order to ensure that Front-End is waiting for back-end request to be finished. Otherwise Angular JS side can make another request in SingleJobController and don’t know why when older request is finished no UI has been changed. We will have a look closer on this behavior.

Does it ring a bell for you probably?

Thank you

Kind Regards
Oleksandr

From: Till Rohrmann <tr...@apache.org>>
Date: Wednesday 19 December 2018 at 16:52
To: Juan Gentile <j....@criteo.com>>
Cc: "dwysakowicz@apache.org<ma...@apache.org>" <dw...@apache.org>>, Jeff Bean <je...@data-artisans.com>>, Oleksandr Nitavskyi <o....@criteo.com>>
Subject: Re: 1.6 UI issues

Hi Juan,

thanks for the log. The log file does not contain anything suspicious. Are you sure that you sent me the right file? The timestamps don't seem to match. In the attached log, the job seems to run without problems.

Cheers,
Till

On Wed, Dec 19, 2018 at 10:26 AM Juan Gentile <j....@criteo.com>> wrote:

Hello Till, Dawid
Sorry for the late response on this issue and thank you Jeff for helping us with this.
Yes we are using 1.6.2
I attach the logs from the Job Master.
Also we noticed this exception:
2018-12-19 08:50:10,497 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler   - Implementation error: Unhandled exception.
java.util.concurrent.CancellationException
    at java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2263)
    at org.apache.flink.runtime.rest.handler.legacy.ExecutionGraphCache.getExecutionGraph(ExecutionGraphCache.java:124)
    at org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.handleRequest(AbstractExecutionGraphHandler.java:76)
    at org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:78)
    at org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:154)
    at org.apache.flink.runtime.rest.handler.RedirectHandler.lambda$null$0(RedirectHandler.java:142)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)
2018-12-19 08:50:17,977 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Implementation error: Unhandled exception.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-760166654]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
    at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
    at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
    at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
    at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
    at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
    at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
    at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
    at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
    at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
    at java.lang.Thread.run(Thread.java:748)

For which we tested with this parameter: -Dakka.ask.timeout=60s
But the issue remains.

Thank you
Juan

From: Till Rohrmann <tr...@apache.org>>
Date: Thursday, 8 November 2018 at 16:06
To: "dwysakowicz@apache.org<ma...@apache.org>" <dw...@apache.org>>
Cc: Juan Gentile <j....@criteo.com>>, "myasuka@live.com<ma...@live.com>" <my...@live.com>>, user <us...@flink.apache.org>>
Subject: Re: 1.6 UI issues

Hi Juan,

could you share the cluster entrypoint logs with us? They should contain more information about the internal server error.

Just to make sure, you are using Flink 1.6.2, right?

Cheers,
Till

On Thu, Nov 8, 2018 at 3:29 PM Dawid Wysakowicz <dw...@apache.org>> wrote:

Hi Juan,

It doesn't look similar to the issue linked to me. What cluster setup are you using? Are you running HA mode?

I am adding Till to cc, who might be able to help you more.

Best,

Dawid
On 02/11/2018 17:26, Juan Gentile wrote:
Hello Yun,

We haven’t seen the error in the log as you mentioned. We also checked the GC and it seems to be okay. Inspecting the UI we found the following error:
Error! Filename not specified.

{"errors":["Could not retrieve the redirect address of the current leader. Please try to refresh."]}
Error! Filename not specified.

We suspect we are running into the same issue as described here (http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/akka-timeout-td14996.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_akka-2Dtimeout-2Dtd14996.html&d=DwMFaQ&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=hidCcXGD2aiyfZuADm1v4XCzvKL2Rsww7WJWxETtxRY&s=zMBuP5aTcdQ5VMavXw1dGvz72efTyTSq6tpbFcPSHxU&e=>) but we are not so sure.

Have you encountered this issue before?

Thank you,

From: Yun Tang <my...@live.com>
Date: Thursday, 1 November 2018 at 12:31
To: Juan Gentile <j....@criteo.com>, "user@flink.apache.org"<ma...@flink.apache.org> <us...@flink.apache.org>
Subject: Re: 1.6 UI issues

Hi Juan

From our experience, you could check the jobmanager.log first to see whether existing similar logs below:

max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytes



If you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].



Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>
Apache Flink 1.6 Documentation: Configuration<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>
Key Default Description; jobmanager.heap.size "1024m" JVM heap size for the JobManager. taskmanager.heap.size "1024m" JVM heap size for the TaskManagers, which are the parallel workers of the system.
ci.apache.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__ci.apache.org&d=DwMFaQ&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=hidCcXGD2aiyfZuADm1v4XCzvKL2Rsww7WJWxETtxRY&s=JGgSIUxh1k57R0OSjnAG8GxwbzWUo6MFercAY-3JL3k&e=>

Best
Yun Tang

________________________________
From: Juan Gentile <j....@criteo.com>
Sent: Wednesday, October 31, 2018 22:05
To: user@flink.apache.org<ma...@flink.apache.org>
Subject: 1.6 UI issues


Hello!



We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either it takes too long to load the information of the job or it never loads at all.



Has anyone had this issue? Any clues as to why?



Thank you,

Juan

Re: 1.6 UI issues

Posted by Till Rohrmann <tr...@apache.org>.

Hi Oleksandr,

the requestJob call should only take longer if either the `JobMaster` is
overloaded and too busy to respond to the request or if the
ArchivedExecutionGraph is very large (e.g. very large accumulators) and
generating it and sending it over to the RestServerEndpoint takes too long.
This is also the change which was introduced with Flink 1.5. Instead of
simply handing over a reference to the RestServerEndpoint from the
JobMaster, the ArchivedExecutionGraph now needs to be sent through the
network stack to the RestServerEndpoint.

If you did not change the akka.framesize then the maximum size of the
ArchivedExecutionGraph should only be 10 MB, though. Therefore, I would
guess that your `JobMaster` must be quite busy if the requests time out.

Cheers,
Till

On Wed, Jan 2, 2019 at 10:58 AM Oleksandr Nitavskyi <o....@criteo.com>
wrote:

> Hello guys. Happy new year!
>
>
>
> Context: we started to have some troubles with UI after bumping our Flink
> version from 1.4 to 1.6.3. UI couldn’t render Job details page, so
> inspecting of the jobs for us has become impossible with the new version.
>
>
>
> And looks like we have a workaround for our UI issue.
>
> After some investigation we realized that starting from Flink 1.5 version
> we started to have a timeout on the actor call: *restfulGateway.requestJob(jobId,
> timeout)* in *ExecutionGraphCache*. So we have increased *web.timeout*
> parameter and we have stopped to have timeout exception on the JobManager
> side.
>
>
>
> Also in *SingleJobController* on the Angular JS side we needed to tweak
> *web.refresh-interval* in order to ensure that Front-End is waiting for
> back-end request to be finished. Otherwise Angular JS side can make another
> request in SingleJobController and don’t know why when older request is
> finished no UI has been changed. We will have a look closer on this
> behavior.
>
>
>
> Does it ring a bell for you probably?
>
>
>
> Thank you
>
>
>
> Kind Regards
>
> Oleksandr
>
>
>
> *From: *Till Rohrmann <tr...@apache.org>
> *Date: *Wednesday 19 December 2018 at 16:52
> *To: *Juan Gentile <j....@criteo.com>
> *Cc: *"dwysakowicz@apache.org" <dw...@apache.org>, Jeff Bean <
> jeff@data-artisans.com>, Oleksandr Nitavskyi <o....@criteo.com>
> *Subject: *Re: 1.6 UI issues
>
>
>
> Hi Juan,
>
>
>
> thanks for the log. The log file does not contain anything suspicious. Are
> you sure that you sent me the right file? The timestamps don't seem to
> match. In the attached log, the job seems to run without problems.
>
>
>
> Cheers,
>
> Till
>
>
>
> On Wed, Dec 19, 2018 at 10:26 AM Juan Gentile <j....@criteo.com>
> wrote:
>
> Hello Till, Dawid
>
> Sorry for the late response on this issue and thank you Jeff for helping
> us with this.
>
> Yes we are using 1.6.2
>
> I attach the logs from the Job Master.
>
> Also we noticed this exception:
>
> 2018-12-19 08:50:10,497 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler   -
> Implementation error: Unhandled exception.
>
> java.util.concurrent.CancellationException
>
>     at
> java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2263)
>
>     at
> org.apache.flink.runtime.rest.handler.legacy.ExecutionGraphCache.getExecutionGraph(ExecutionGraphCache.java:124)
>
>     at
> org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.handleRequest(AbstractExecutionGraphHandler.java:76)
>
>     at
> org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:78)
>
>     at
> org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:154)
>
>     at
> org.apache.flink.runtime.rest.handler.RedirectHandler.lambda$null$0(RedirectHandler.java:142)
>
>     at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>
>     at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>
>     at
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>
>     at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
>
>     at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
>
>     at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
>
>     at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
>
>     at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>
>     at java.lang.Thread.run(Thread.java:748)
>
> 2018-12-19 08:50:17,977 ERROR
> org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  -
> Implementation error: Unhandled exception.
>
> akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka://flink/user/dispatcher#-760166654]] after [10000 ms].
> Sender[null] sent message of type
> "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
>
>     at
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
>
>     at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>
>     at
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>
>     at
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>
>     at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>
>     at
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>
>     at
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>
>     at
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>
>     at
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>
>     at java.lang.Thread.run(Thread.java:748)
>
>
>
> For which we tested with this parameter: -Dakka.ask.timeout=60s
>
> But the issue remains.
>
>
>
> Thank you
>
> Juan
>
>
>
> *From: *Till Rohrmann <tr...@apache.org>
> *Date: *Thursday, 8 November 2018 at 16:06
> *To: *"dwysakowicz@apache.org" <dw...@apache.org>
> *Cc: *Juan Gentile <j....@criteo.com>, "myasuka@live.com" <
> myasuka@live.com>, user <us...@flink.apache.org>
> *Subject: *Re: 1.6 UI issues
>
>
>
> Hi Juan,
>
>
>
> could you share the cluster entrypoint logs with us? They should contain
> more information about the internal server error.
>
>
>
> Just to make sure, you are using Flink 1.6.2, right?
>
>
>
> Cheers,
>
> Till
>
>
>
> On Thu, Nov 8, 2018 at 3:29 PM Dawid Wysakowicz <dw...@apache.org>
> wrote:
>
> Hi Juan,
>
> It doesn't look similar to the issue linked to me. What cluster setup are
> you using? Are you running HA mode?
>
> I am adding Till to cc, who might be able to help you more.
>
> Best,
>
> Dawid
>
> On 02/11/2018 17:26, Juan Gentile wrote:
>
> Hello Yun,
>
>
>
> We haven’t seen the error in the log as you mentioned. We also checked the
> GC and it seems to be okay. Inspecting the UI we found the following error:
> *Error! Filename not specified.*
>
>
>
> {"errors":["Could not retrieve the redirect address of the current leader.
> Please try to refresh."]}
>
> *Error! Filename not specified.*
>
>
>
> We suspect we are running into the same issue as described here (
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/akka-timeout-td14996.html
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_akka-2Dtimeout-2Dtd14996.html&d=DwMFaQ&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=hidCcXGD2aiyfZuADm1v4XCzvKL2Rsww7WJWxETtxRY&s=zMBuP5aTcdQ5VMavXw1dGvz72efTyTSq6tpbFcPSHxU&e=>)
> but we are not so sure.
>
>
>
> Have you encountered this issue before?
>
>
>
> Thank you,
>
>
>
> *From: *Yun Tang <my...@live.com> <my...@live.com>
> *Date: *Thursday, 1 November 2018 at 12:31
> *To: *Juan Gentile <j....@criteo.com> <j....@criteo.com>,
> "user@flink.apache.org" <us...@flink.apache.org> <us...@flink.apache.org>
> <us...@flink.apache.org>
> *Subject: *Re: 1.6 UI issues
>
>
>
> Hi Juan
>
>
>
> From our experience, you could check the jobmanager.log first to see
> whether existing similar logs below:
>
> max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytes
>
>
>
> If you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].
>
>
>
> Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>
>
> Apache Flink 1.6 Documentation: Configuration
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>
>
> Key Default Description; jobmanager.heap.size "1024m" JVM heap size for
> the JobManager. taskmanager.heap.size "1024m" JVM heap size for the
> TaskManagers, which are the parallel workers of the system.
>
> ci.apache.org
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ci.apache.org&d=DwMFaQ&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=hidCcXGD2aiyfZuADm1v4XCzvKL2Rsww7WJWxETtxRY&s=JGgSIUxh1k57R0OSjnAG8GxwbzWUo6MFercAY-3JL3k&e=>
>
> Best
>
> Yun Tang
>
>
> ------------------------------
>
> *From:* Juan Gentile <j....@criteo.com> <j....@criteo.com>
> *Sent:* Wednesday, October 31, 2018 22:05
> *To:* user@flink.apache.org
> *Subject:* 1.6 UI issues
>
>
>
> Hello!
>
>
>
> We are migrating the the last 1.6 version and all the jobs seem to work
> fine, but when we check individual jobs through the web interface we
> encounter the issue that after clicking on a job, either it takes too long
> to load the information of the job or it never loads at all.
>
>
>
> Has anyone had this issue? Any clues as to why?
>
>
>
> Thank you,
>
> Juan
>
>

Re: 1.6 UI issues

Posted by Oleksandr Nitavskyi <o....@criteo.com>.

Hello guys. Happy new year!

Context: we started to have some troubles with UI after bumping our Flink version from 1.4 to 1.6.3. UI couldn’t render Job details page, so inspecting of the jobs for us has become impossible with the new version.

And looks like we have a workaround for our UI issue.
After some investigation we realized that starting from Flink 1.5 version we started to have a timeout on the actor call: restfulGateway.requestJob(jobId, timeout) in ExecutionGraphCache. So we have increased web.timeout parameter and we have stopped to have timeout exception on the JobManager side.

Also in SingleJobController on the Angular JS side we needed to tweak web.refresh-interval in order to ensure that Front-End is waiting for back-end request to be finished. Otherwise Angular JS side can make another request in SingleJobController and don’t know why when older request is finished no UI has been changed. We will have a look closer on this behavior.

Does it ring a bell for you probably?

Thank you

Kind Regards
Oleksandr

From: Till Rohrmann <tr...@apache.org>
Date: Wednesday 19 December 2018 at 16:52
To: Juan Gentile <j....@criteo.com>
Cc: "dwysakowicz@apache.org" <dw...@apache.org>, Jeff Bean <je...@data-artisans.com>, Oleksandr Nitavskyi <o....@criteo.com>
Subject: Re: 1.6 UI issues

Hi Juan,

thanks for the log. The log file does not contain anything suspicious. Are you sure that you sent me the right file? The timestamps don't seem to match. In the attached log, the job seems to run without problems.

Cheers,
Till

On Wed, Dec 19, 2018 at 10:26 AM Juan Gentile <j....@criteo.com>> wrote:

Hello Till, Dawid
Sorry for the late response on this issue and thank you Jeff for helping us with this.
Yes we are using 1.6.2
I attach the logs from the Job Master.
Also we noticed this exception:
2018-12-19 08:50:10,497 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler   - Implementation error: Unhandled exception.
java.util.concurrent.CancellationException
    at java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2263)
    at org.apache.flink.runtime.rest.handler.legacy.ExecutionGraphCache.getExecutionGraph(ExecutionGraphCache.java:124)
    at org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.handleRequest(AbstractExecutionGraphHandler.java:76)
    at org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:78)
    at org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:154)
    at org.apache.flink.runtime.rest.handler.RedirectHandler.lambda$null$0(RedirectHandler.java:142)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)
2018-12-19 08:50:17,977 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Implementation error: Unhandled exception.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-760166654]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
    at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
    at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
    at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
    at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
    at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
    at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
    at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
    at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
    at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
    at java.lang.Thread.run(Thread.java:748)

For which we tested with this parameter: -Dakka.ask.timeout=60s
But the issue remains.

Thank you
Juan

From: Till Rohrmann <tr...@apache.org>>
Date: Thursday, 8 November 2018 at 16:06
To: "dwysakowicz@apache.org<ma...@apache.org>" <dw...@apache.org>>
Cc: Juan Gentile <j....@criteo.com>>, "myasuka@live.com<ma...@live.com>" <my...@live.com>>, user <us...@flink.apache.org>>
Subject: Re: 1.6 UI issues

Hi Juan,

could you share the cluster entrypoint logs with us? They should contain more information about the internal server error.

Just to make sure, you are using Flink 1.6.2, right?

Cheers,
Till

On Thu, Nov 8, 2018 at 3:29 PM Dawid Wysakowicz <dw...@apache.org>> wrote:

Hi Juan,

It doesn't look similar to the issue linked to me. What cluster setup are you using? Are you running HA mode?

I am adding Till to cc, who might be able to help you more.

Best,

Dawid
On 02/11/2018 17:26, Juan Gentile wrote:
Hello Yun,

We haven’t seen the error in the log as you mentioned. We also checked the GC and it seems to be okay. Inspecting the UI we found the following error:
Error! Filename not specified.

{"errors":["Could not retrieve the redirect address of the current leader. Please try to refresh."]}
Error! Filename not specified.

We suspect we are running into the same issue as described here (http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/akka-timeout-td14996.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_akka-2Dtimeout-2Dtd14996.html&d=DwMFaQ&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=hidCcXGD2aiyfZuADm1v4XCzvKL2Rsww7WJWxETtxRY&s=zMBuP5aTcdQ5VMavXw1dGvz72efTyTSq6tpbFcPSHxU&e=>) but we are not so sure.

Have you encountered this issue before?

Thank you,

From: Yun Tang <my...@live.com>
Date: Thursday, 1 November 2018 at 12:31
To: Juan Gentile <j....@criteo.com>, "user@flink.apache.org"<ma...@flink.apache.org> <us...@flink.apache.org>
Subject: Re: 1.6 UI issues

Hi Juan

From our experience, you could check the jobmanager.log first to see whether existing similar logs below:

max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytes



If you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].



Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>
Apache Flink 1.6 Documentation: Configuration<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>
Key Default Description; jobmanager.heap.size "1024m" JVM heap size for the JobManager. taskmanager.heap.size "1024m" JVM heap size for the TaskManagers, which are the parallel workers of the system.
ci.apache.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__ci.apache.org&d=DwMFaQ&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=hidCcXGD2aiyfZuADm1v4XCzvKL2Rsww7WJWxETtxRY&s=JGgSIUxh1k57R0OSjnAG8GxwbzWUo6MFercAY-3JL3k&e=>

Best
Yun Tang

________________________________
From: Juan Gentile <j....@criteo.com>
Sent: Wednesday, October 31, 2018 22:05
To: user@flink.apache.org<ma...@flink.apache.org>
Subject: 1.6 UI issues


Hello!



We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either it takes too long to load the information of the job or it never loads at all.



Has anyone had this issue? Any clues as to why?



Thank you,

Juan

Re: 1.6 UI issues

Posted by Till Rohrmann <tr...@apache.org>.

Hi Juan,

could you share the cluster entrypoint logs with us? They should contain
more information about the internal server error.

Just to make sure, you are using Flink 1.6.2, right?

Cheers,
Till

On Thu, Nov 8, 2018 at 3:29 PM Dawid Wysakowicz <dw...@apache.org>
wrote:

> Hi Juan,
>
> It doesn't look similar to the issue linked to me. What cluster setup are
> you using? Are you running HA mode?
>
> I am adding Till to cc, who might be able to help you more.
>
> Best,
>
> Dawid
> On 02/11/2018 17:26, Juan Gentile wrote:
>
> Hello Yun,
>
>
>
> We haven’t seen the error in the log as you mentioned. We also checked the
> GC and it seems to be okay. Inspecting the UI we found the following error:
>
>
>
> {"errors":["Could not retrieve the redirect address of the current leader.
> Please try to refresh."]}
>
>
>
> We suspect we are running into the same issue as described here (
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/akka-timeout-td14996.html)
> but we are not so sure.
>
>
>
> Have you encountered this issue before?
>
>
>
> Thank you,
>
>
>
> *From: *Yun Tang <my...@live.com> <my...@live.com>
> *Date: *Thursday, 1 November 2018 at 12:31
> *To: *Juan Gentile <j....@criteo.com> <j....@criteo.com>,
> "user@flink.apache.org" <us...@flink.apache.org> <us...@flink.apache.org>
> <us...@flink.apache.org>
> *Subject: *Re: 1.6 UI issues
>
>
>
> Hi Juan
>
>
>
> From our experience, you could check the jobmanager.log first to see
> whether existing similar logs below:
>
> max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytes
> If you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].
>
> Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>
>
> Apache Flink 1.6 Documentation: Configuration
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>
>
> Key Default Description; jobmanager.heap.size "1024m" JVM heap size for
> the JobManager. taskmanager.heap.size "1024m" JVM heap size for the
> TaskManagers, which are the parallel workers of the system.
>
> ci.apache.org
>
> Best
>
> Yun Tang
>
>
> ------------------------------
>
> *From:* Juan Gentile <j....@criteo.com> <j....@criteo.com>
> *Sent:* Wednesday, October 31, 2018 22:05
> *To:* user@flink.apache.org
> *Subject:* 1.6 UI issues
>
>
>
> Hello!
>
>
>
> We are migrating the the last 1.6 version and all the jobs seem to work
> fine, but when we check individual jobs through the web interface we
> encounter the issue that after clicking on a job, either it takes too long
> to load the information of the job or it never loads at all.
>
>
>
> Has anyone had this issue? Any clues as to why?
>
>
>
> Thank you,
>
> Juan
>
>

Re: 1.6 UI issues

Posted by Dawid Wysakowicz <dw...@apache.org>.

Hi Juan,

It doesn't look similar to the issue linked to me. What cluster setup
are you using? Are you running HA mode?

I am adding Till to cc, who might be able to help you more.

Best,

Dawid

On 02/11/2018 17:26, Juan Gentile wrote:
>
> Hello Yun,
>
>  
>
> We haven’t seen the error in the log as you mentioned. We also checked
> the GC and it seems to be okay. Inspecting the UI we found the
> following error:
>
>  
>
> {"errors":["Could not retrieve the redirect address of the current
> leader. Please try to refresh."]}
>
>  
>
> We suspect we are running into the same issue as described here
> (http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/akka-timeout-td14996.html)
> but we are not so sure.
>
>  
>
> Have you encountered this issue before?
>
>  
>
> Thank you,
>
>  
>
> *From: *Yun Tang <my...@live.com>
> *Date: *Thursday, 1 November 2018 at 12:31
> *To: *Juan Gentile <j....@criteo.com>, "user@flink.apache.org"
> <us...@flink.apache.org>
> *Subject: *Re: 1.6 UI issues
>
>  
>
> Hi Juan
>
>  
>
> From our experience, you could check the jobmanager.log first to see
> whether existing similar logs below:
>
> |max allowed size 128000 bytes, actual size of encoded class
> akka.actor.Status$Success was xxx bytes|If you see these logs, you should increase the akka.framesize to
> larger value (default value is '10485760b') [1]. Otherwise, you could
> check the gc-log of job manager to see whether the gc overhead is too
> heavy for your job manager, consider to increase the memory for your
> job manager if so.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>
>
> Apache Flink 1.6 Documentation: Configuration
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>
>
> Key Default Description; jobmanager.heap.size "1024m" JVM heap size
> for the JobManager. taskmanager.heap.size "1024m" JVM heap size for
> the TaskManagers, which are the parallel workers of the system.
>
> ci.apache.org
>
> Best
>
> Yun Tang
>
>  
>
> ------------------------------------------------------------------------
>
> *From:*Juan Gentile <j....@criteo.com>
> *Sent:* Wednesday, October 31, 2018 22:05
> *To:* user@flink.apache.org
> *Subject:* 1.6 UI issues
>
>  
>
> Hello!
>
>  
>
> We are migrating the the last 1.6 version and all the jobs seem to
> work fine, but when we check individual jobs through the web interface
> we encounter the issue that after clicking on a job, either it takes
> too long to load the information of the job or it never loads at all.
>
>  
>
> Has anyone had this issue? Any clues as to why?
>
>  
>
> Thank you,
>
> Juan
>

Re: 1.6 UI issues

Posted by Juan Gentile <j....@criteo.com>.

Hello Yun,

We haven’t seen the error in the log as you mentioned. We also checked the GC and it seems to be okay. Inspecting the UI we found the following error:
[cid:image001.png@01D472D1.330B7FF0]

{"errors":["Could not retrieve the redirect address of the current leader. Please try to refresh."]}
[cid:image002.png@01D472D1.330B7FF0]

We suspect we are running into the same issue as described here (http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/akka-timeout-td14996.html) but we are not so sure.

Have you encountered this issue before?

Thank you,

From: Yun Tang <my...@live.com>
Date: Thursday, 1 November 2018 at 12:31
To: Juan Gentile <j....@criteo.com>, "user@flink.apache.org" <us...@flink.apache.org>
Subject: Re: 1.6 UI issues

Hi Juan

From our experience, you could check the jobmanager.log first to see whether existing similar logs below:

max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytes

If you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].

Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>
Apache Flink 1.6 Documentation: Configuration<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>
Key Default Description; jobmanager.heap.size "1024m" JVM heap size for the JobManager. taskmanager.heap.size "1024m" JVM heap size for the TaskManagers, which are the parallel workers of the system.
ci.apache.org

Best
Yun Tang

________________________________
From: Juan Gentile <j....@criteo.com>
Sent: Wednesday, October 31, 2018 22:05
To: user@flink.apache.org
Subject: 1.6 UI issues


Hello!



We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either it takes too long to load the information of the job or it never loads at all.



Has anyone had this issue? Any clues as to why?



Thank you,

Juan

Re: 1.6 UI issues

Posted by Yun Tang <my...@live.com>.

Hi Juan

From our experience, you could check the jobmanager.log first to see whether existing similar logs below:

max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytes

If you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].

Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.


[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka
Apache Flink 1.6 Documentation: Configuration<https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka>
Key Default Description; jobmanager.heap.size "1024m" JVM heap size for the JobManager. taskmanager.heap.size "1024m" JVM heap size for the TaskManagers, which are the parallel workers of the system.
ci.apache.org

Best
Yun Tang

________________________________
From: Juan Gentile <j....@criteo.com>
Sent: Wednesday, October 31, 2018 22:05
To: user@flink.apache.org
Subject: 1.6 UI issues


Hello!



We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either it takes too long to load the information of the job or it never loads at all.



Has anyone had this issue? Any clues as to why?



Thank you,

Juan