You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by "Martin, Nick J [US] (IS)" <Ni...@ngc.com> on 2019/10/15 23:15:03 UTC

Jar Uploads in High Availability (Flink 1.7.2)

I'm seeing that when I upload a jar through the rest API, it looks like only the Jobmanager that received the upload request is aware of the newly uploaded jar. That worked fine for me in older versions where all clients were redirected to connect to the leader, but now that each Jobmanager accepts requests, if I send a jar upload request, it could end up on any one (and only one) of the Jobmanagers, not necessarily the leader. Further, each Jobmanager responds to a GET request on the /jars endpoint with its own local list of jars. If I try and use one of the Jar IDs from that request, my next request may not go to the same Jobmanager (requests are going through Docker and being load-balanced), and so the Jar ID isn't found on the new Jobmanager handling that request.

RE: EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)

Posted by "Martin, Nick J [US] (IS)" <Ni...@ngc.com>.

So I think what you’re saying is if I use a DFS for web.upload.dir, my clients can send all their requests to any Job Manager instance and not worry or care which one is the leader. That definitely is an improvement, thanks.

From: Till Rohrmann [mailto:trohrmann@apache.org]
Sent: Friday, October 18, 2019 6:42 AM
To: Martin, Nick J [US] (IS) <Ni...@ngc.com>
Cc: Ravi Bhushan Ratnakar <ra...@gmail.com>; user <us...@flink.apache.org>
Subject: Re: EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)

Hi Martin,

Flink's web UI based job submission is not well suited to be run behind a load balancer at the moment. The problem is that the web based job submission is actually a two phase operation: Uploading the jars and then starting the job. Since Flink's RestServer stores the uploaded files locally, it is required that the web submission is executed on the same RestServer to which you also uploaded the files before. Note, however, that the cli client job submission is not affected by this since the job graph upload and submission is one request.

A workaround to make the uploads accessible to all RestServers is to configure a DFS for the `web.upload.dir` as Ravi suggested or to use Flink's CLI to submit jobs instead.

A quick note about the old behaviour with the redirects. The redirects actually defied the purpose of load balancers because all requests were redirected to a single RestServer instance. Hence, running it with or w/o load balancer should not have made a big difference.

Cheers,
Till

On Wed, Oct 16, 2019 at 5:58 PM Martin, Nick J [US] (IS) <Ni...@ngc.com>> wrote:
Yeah, I’ll do that if I have to. I’m hoping there’s a ‘right’ way to do it that’s easier. If I have to implement the zookeeper lookups in my load balancer myself, that feels like a definite step backwards from the pre-1.5 days when the cluster would give 307 redirects to the current leader

From: Ravi Bhushan Ratnakar [mailto:ravibhushanratnakar@gmail.com<ma...@gmail.com>]
Sent: Tuesday, October 15, 2019 10:35 PM
To: Martin, Nick J [US] (IS) <Ni...@ngc.com>>
Cc: user <us...@flink.apache.org>>
Subject: EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)

Hi,

i was also experiencing with the similar behavior. I adopted following approach

  *    used a distributed file system(in my case aws efs) and set the attribute "web.upload.dir", this way both the job manager have same location.
  *   on the load balancer side(aws elb), i used "readiness probe" based on zookeeper entry for active jobmanager address, this way elb always point to the active job manager and if the active jobmanager changes then it automatically point to the new active jobmanager and as both are using the same location by configuring distributed file system so new active job is able to find the same jar.

Regards,
Ravi

On Wed, Oct 16, 2019 at 1:15 AM Martin, Nick J [US] (IS) <Ni...@ngc.com>> wrote:
I’m seeing that when I upload a jar through the rest API, it looks like only the Jobmanager that received the upload request is aware of the newly uploaded jar. That worked fine for me in older versions where all clients were redirected to connect to the leader, but now that each Jobmanager accepts requests, if I send a jar upload request, it could end up on any one (and only one) of the Jobmanagers, not necessarily the leader. Further, each Jobmanager responds to a GET request on the /jars endpoint with its own local list of jars. If I try and use one of the Jar IDs from that request, my next request may not go to the same Jobmanager (requests are going through Docker and being load-balanced), and so the Jar ID isn’t found on the new Jobmanager handling that request.

Re: EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)

Posted by Zili Chen <wa...@gmail.com>.

FYI there is already a corresponding issue
https://issues.apache.org/jira/browse/FLINK-13660

Best,
tison.


Till Rohrmann <tr...@apache.org> 于2019年10月18日周五 下午9:42写道：

> Hi Martin,
>
> Flink's web UI based job submission is not well suited to be run behind a
> load balancer at the moment. The problem is that the web based job
> submission is actually a two phase operation: Uploading the jars and then
> starting the job. Since Flink's RestServer stores the uploaded files
> locally, it is required that the web submission is executed on the same
> RestServer to which you also uploaded the files before. Note, however, that
> the cli client job submission is not affected by this since the job graph
> upload and submission is one request.
>
> A workaround to make the uploads accessible to all RestServers is to
> configure a DFS for the `web.upload.dir` as Ravi suggested or to use
> Flink's CLI to submit jobs instead.
>
> A quick note about the old behaviour with the redirects. The redirects
> actually defied the purpose of load balancers because all requests were
> redirected to a single RestServer instance. Hence, running it with or w/o
> load balancer should not have made a big difference.
>
> Cheers,
> Till
>
> On Wed, Oct 16, 2019 at 5:58 PM Martin, Nick J [US] (IS) <
> Nick.Martin@ngc.com> wrote:
>
>> Yeah, I’ll do that if I have to. I’m hoping there’s a ‘right’ way to do
>> it that’s easier. If I have to implement the zookeeper lookups in my load
>> balancer myself, that feels like a definite step backwards from the pre-1.5
>> days when the cluster would give 307 redirects to the current leader
>>
>>
>>
>> *From:* Ravi Bhushan Ratnakar [mailto:ravibhushanratnakar@gmail.com]
>> *Sent:* Tuesday, October 15, 2019 10:35 PM
>> *To:* Martin, Nick J [US] (IS) <Ni...@ngc.com>
>> *Cc:* user <us...@flink.apache.org>
>> *Subject:* EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)
>>
>>
>>
>> Hi,
>>
>>
>>
>> i was also experiencing with the similar behavior. I adopted following
>> approach
>>
>>    -  used a distributed file system(in my case aws efs) and set the
>>    attribute "web.upload.dir", this way both the job manager have same
>>    location.
>>    - on the load balancer side(aws elb), i used "readiness probe" based
>>    on zookeeper entry for active jobmanager address, this way elb always point
>>    to the active job manager and if the active jobmanager changes then it
>>    automatically point to the new active jobmanager and as both are using the
>>    same location by configuring distributed file system so new active job is
>>    able to find the same jar.
>>
>>
>>
>> Regards,
>>
>> Ravi
>>
>>
>>
>> On Wed, Oct 16, 2019 at 1:15 AM Martin, Nick J [US] (IS) <
>> Nick.Martin@ngc.com> wrote:
>>
>> I’m seeing that when I upload a jar through the rest API, it looks like
>> only the Jobmanager that received the upload request is aware of the newly
>> uploaded jar. That worked fine for me in older versions where all clients
>> were redirected to connect to the leader, but now that each Jobmanager
>> accepts requests, if I send a jar upload request, it could end up on any
>> one (and only one) of the Jobmanagers, not necessarily the leader. Further,
>> each Jobmanager responds to a GET request on the /jars endpoint with its
>> own local list of jars. If I try and use one of the Jar IDs from that
>> request, my next request may not go to the same Jobmanager (requests are
>> going through Docker and being load-balanced), and so the Jar ID isn’t
>> found on the new Jobmanager handling that request.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>

Re: EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)

Posted by Till Rohrmann <tr...@apache.org>.

Hi Martin,

Flink's web UI based job submission is not well suited to be run behind a
load balancer at the moment. The problem is that the web based job
submission is actually a two phase operation: Uploading the jars and then
starting the job. Since Flink's RestServer stores the uploaded files
locally, it is required that the web submission is executed on the same
RestServer to which you also uploaded the files before. Note, however, that
the cli client job submission is not affected by this since the job graph
upload and submission is one request.

A workaround to make the uploads accessible to all RestServers is to
configure a DFS for the `web.upload.dir` as Ravi suggested or to use
Flink's CLI to submit jobs instead.

A quick note about the old behaviour with the redirects. The redirects
actually defied the purpose of load balancers because all requests were
redirected to a single RestServer instance. Hence, running it with or w/o
load balancer should not have made a big difference.

Cheers,
Till

On Wed, Oct 16, 2019 at 5:58 PM Martin, Nick J [US] (IS) <
Nick.Martin@ngc.com> wrote:

> Yeah, I’ll do that if I have to. I’m hoping there’s a ‘right’ way to do it
> that’s easier. If I have to implement the zookeeper lookups in my load
> balancer myself, that feels like a definite step backwards from the pre-1.5
> days when the cluster would give 307 redirects to the current leader
>
>
>
> *From:* Ravi Bhushan Ratnakar [mailto:ravibhushanratnakar@gmail.com]
> *Sent:* Tuesday, October 15, 2019 10:35 PM
> *To:* Martin, Nick J [US] (IS) <Ni...@ngc.com>
> *Cc:* user <us...@flink.apache.org>
> *Subject:* EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)
>
>
>
> Hi,
>
>
>
> i was also experiencing with the similar behavior. I adopted following
> approach
>
>    -  used a distributed file system(in my case aws efs) and set the
>    attribute "web.upload.dir", this way both the job manager have same
>    location.
>    - on the load balancer side(aws elb), i used "readiness probe" based
>    on zookeeper entry for active jobmanager address, this way elb always point
>    to the active job manager and if the active jobmanager changes then it
>    automatically point to the new active jobmanager and as both are using the
>    same location by configuring distributed file system so new active job is
>    able to find the same jar.
>
>
>
> Regards,
>
> Ravi
>
>
>
> On Wed, Oct 16, 2019 at 1:15 AM Martin, Nick J [US] (IS) <
> Nick.Martin@ngc.com> wrote:
>
> I’m seeing that when I upload a jar through the rest API, it looks like
> only the Jobmanager that received the upload request is aware of the newly
> uploaded jar. That worked fine for me in older versions where all clients
> were redirected to connect to the leader, but now that each Jobmanager
> accepts requests, if I send a jar upload request, it could end up on any
> one (and only one) of the Jobmanagers, not necessarily the leader. Further,
> each Jobmanager responds to a GET request on the /jars endpoint with its
> own local list of jars. If I try and use one of the Jar IDs from that
> request, my next request may not go to the same Jobmanager (requests are
> going through Docker and being load-balanced), and so the Jar ID isn’t
> found on the new Jobmanager handling that request.
>
>
>
>
>
>
>
>
>
>

RE: EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)

Posted by "Martin, Nick J [US] (IS)" <Ni...@ngc.com>.

Yeah, I’ll do that if I have to. I’m hoping there’s a ‘right’ way to do it that’s easier. If I have to implement the zookeeper lookups in my load balancer myself, that feels like a definite step backwards from the pre-1.5 days when the cluster would give 307 redirects to the current leader

From: Ravi Bhushan Ratnakar [mailto:ravibhushanratnakar@gmail.com]
Sent: Tuesday, October 15, 2019 10:35 PM
To: Martin, Nick J [US] (IS) <Ni...@ngc.com>
Cc: user <us...@flink.apache.org>
Subject: EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)

Hi,

i was also experiencing with the similar behavior. I adopted following approach

  *    used a distributed file system(in my case aws efs) and set the attribute "web.upload.dir", this way both the job manager have same location.
  *   on the load balancer side(aws elb), i used "readiness probe" based on zookeeper entry for active jobmanager address, this way elb always point to the active job manager and if the active jobmanager changes then it automatically point to the new active jobmanager and as both are using the same location by configuring distributed file system so new active job is able to find the same jar.

Regards,
Ravi

On Wed, Oct 16, 2019 at 1:15 AM Martin, Nick J [US] (IS) <Ni...@ngc.com>> wrote:
I’m seeing that when I upload a jar through the rest API, it looks like only the Jobmanager that received the upload request is aware of the newly uploaded jar. That worked fine for me in older versions where all clients were redirected to connect to the leader, but now that each Jobmanager accepts requests, if I send a jar upload request, it could end up on any one (and only one) of the Jobmanagers, not necessarily the leader. Further, each Jobmanager responds to a GET request on the /jars endpoint with its own local list of jars. If I try and use one of the Jar IDs from that request, my next request may not go to the same Jobmanager (requests are going through Docker and being load-balanced), and so the Jar ID isn’t found on the new Jobmanager handling that request.

Re: Jar Uploads in High Availability (Flink 1.7.2)

Posted by Ravi Bhushan Ratnakar <ra...@gmail.com>.

Hi,

i was also experiencing with the similar behavior. I adopted following
approach

   -  used a distributed file system(in my case aws efs) and set the
   attribute "web.upload.dir", this way both the job manager have same
   location.
   - on the load balancer side(aws elb), i used "readiness probe" based on
   zookeeper entry for active jobmanager address, this way elb always point to
   the active job manager and if the active jobmanager changes then it
   automatically point to the new active jobmanager and as both are using the
   same location by configuring distributed file system so new active job is
   able to find the same jar.


Regards,
Ravi

On Wed, Oct 16, 2019 at 1:15 AM Martin, Nick J [US] (IS) <
Nick.Martin@ngc.com> wrote:

> I’m seeing that when I upload a jar through the rest API, it looks like
> only the Jobmanager that received the upload request is aware of the newly
> uploaded jar. That worked fine for me in older versions where all clients
> were redirected to connect to the leader, but now that each Jobmanager
> accepts requests, if I send a jar upload request, it could end up on any
> one (and only one) of the Jobmanagers, not necessarily the leader. Further,
> each Jobmanager responds to a GET request on the /jars endpoint with its
> own local list of jars. If I try and use one of the Jar IDs from that
> request, my next request may not go to the same Jobmanager (requests are
> going through Docker and being load-balanced), and so the Jar ID isn’t
> found on the new Jobmanager handling that request.
>
>
>
>
>
>
>
>
>