You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Joey Echeverria <je...@splunk.com> on 2018/08/01 19:39:41 UTC

Re: Delay in REST/UI readiness during JM recovery

Sorry to ping my own thread, but has anyone else encountered this?

-Joey

> On Jul 30, 2018, at 11:10 AM, Joey Echeverria <je...@splunk.com> wrote:
> 
> I’m running Flink 1.5.0 in Kubernetes with HA enabled, but only a single Job Manager running. I’m using Zookeeper to store the fencing/leader information and S3 to store the job manager state. We’ve been running around 250 or so streaming jobs and we’ve noticed that if the job manager pod is deleted, it takes something like 20-45 minutes for the job manager’s REST endpoints and web UI to become available. Until it becomes available, we get a 503 response from the HTTP server with the message "Could not retrieve the redirect address of the current leader. Please try to refresh.”.
> 
> Has anyone else run into this?
> 
> Are there any configuration settings I should be looking at to speed up the availability of the HTTP endpoints?
> 
> Thanks!
> 
> -Joey

Re: Delay in REST/UI readiness during JM recovery

Posted by vino yang <ya...@gmail.com>.

Hi Joey,

Currently rest endpoints are hosted in JM. Your scenario is at JM failover,
and your cluster is running so many jobs. Here, it takes a certain amount
of time for ZK to conduct the Leader election. Then JM needs to wait for
the TM registration. So many jobs need to be restored and start running. It
is likely to go through a long period of time, so within this period. JM
can be quite busy and can cause web services to be unresponsive or slow to
respond.
But 20-45 minutes is really long, so you first need to confirm what caused
it. For example, if you reduce the cluster's job data by half, can the web
response time be much faster?

Thanks, vino.

2018-08-02 3:39 GMT+08:00 Joey Echeverria <je...@splunk.com>:

> Sorry to ping my own thread, but has anyone else encountered this?
>
> -Joey
>
> > On Jul 30, 2018, at 11:10 AM, Joey Echeverria <je...@splunk.com>
> wrote:
> >
> > I’m running Flink 1.5.0 in Kubernetes with HA enabled, but only a single
> Job Manager running. I’m using Zookeeper to store the fencing/leader
> information and S3 to store the job manager state. We’ve been running
> around 250 or so streaming jobs and we’ve noticed that if the job manager
> pod is deleted, it takes something like 20-45 minutes for the job manager’s
> REST endpoints and web UI to become available. Until it becomes available,
> we get a 503 response from the HTTP server with the message "Could not
> retrieve the redirect address of the current leader. Please try to
> refresh.”.
> >
> > Has anyone else run into this?
> >
> > Are there any configuration settings I should be looking at to speed up
> the availability of the HTTP endpoints?
> >
> > Thanks!
> >
> > -Joey
>
>