You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by "Chan, Regina" <Re...@gs.com> on 2017/10/30 19:22:19 UTC

Job Manager Configuration

Flink Users,

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I've already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

Regina Chan
Goldman Sachs - Enterprise Platforms, Data Architecture
30 Hudson Street, 37th floor | Jersey City, NY 07302 *  (212) 902-5697

RE: Job Manager Configuration

Posted by "Newport, Billy" <Bi...@gs.com>.

The user code for all the flows is common though so is there an inefficiency here in terms of copying this code for every operator?

From: Chesnay Schepler [mailto:chesnay@apache.org]
Sent: Wednesday, November 01, 2017 7:09 AM
To: user@flink.apache.org
Subject: Re: Job Manager Configuration

AFAIK there is no theoretical limit on the size of the plan, it just depends on the available resources.

The job submissions times out since it takes too long to deploy all the operators that the job defines. With 300 flows, each with 6 operators you're looking at potentially (1800 * parallelism) tasks that have to be deployed. For each task Flink copies the user-code of all flows to the executing TaskManager, which the network may just not be handle in time.

I suggest to split your job into smaller batches or even run each of them independently.

On 31.10.2017 16:25, Chan, Regina wrote:
Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don't need to run in parallel and can run independently. I wanted them to run in one single job because it's part of one logical commit on my side.

Thanks,
Regina

From: Chan, Regina [Tech]
Sent: Monday, October 30, 2017 3:22 PM
To: 'user@flink.apache.org<ma...@flink.apache.org>'
Subject: Job Manager Configuration

Flink Users,

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I've already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

Regina Chan
Goldman Sachs - Enterprise Platforms, Data Architecture
30 Hudson Street, 37th floor | Jersey City, NY 07302 *  (212) 902-5697

Re: Job Manager Configuration

Posted by Till Rohrmann <tr...@apache.org>.

That is the question I hope to be able to answer with the logs. Let's see
what they say.

Cheers,
Till

On Wed, Nov 8, 2017 at 7:24 PM, Chan, Regina <Re...@gs.com> wrote:

> Thanks for the responses!
>
>
>
> I’m currently using 1.2.0 – going to bump it up once I have things
> stabilized. I haven’t defined any slot sharing groups but I do think that
> I’ve probably got my job configured sub optimally. I’ve refactored my code
> so that I can submit subsets of the flow at a time and it seems to work.
> The break between the JobManager able to acknowledge job and not seems to
> hover somewhere between 10-20 flows.
>
>
>
> I guess what doesn’t make too much sense to me is if the user code is
> uploaded once to the JobManager and downloaded from each TaskManager, what
> exactly is the JobManager doing that’s keeping it busy? It’s the same code
> across the TaskManagers.
>
>
>
> I’ll get you the logs shortly.
>
>
>
> *From:* Till Rohrmann [mailto:trohrmann@apache.org]
> *Sent:* Wednesday, November 08, 2017 10:17 AM
> *To:* Chan, Regina [Tech]
> *Cc:* Chesnay Schepler; user@flink.apache.org
>
> *Subject:* Re: Job Manager Configuration
>
>
>
> Quick question Regina: Which version of Flink are you running?
>
>
>
> Cheers,
> Till
>
>
>
> On Tue, Nov 7, 2017 at 4:38 PM, Till Rohrmann <ti...@gmail.com>
> wrote:
>
> Hi Regina,
>
>
>
> the user code is uploaded once to the `JobManager` and then downloaded
> from each `TaskManager` once when it first receives the command to execute
> the first task of your job.
>
>
>
> As Chesnay said there is no fundamental limitation to the size of the
> Flink job. However, it might be the case that you have configured your job
> sub-optimally. You said that you have 300 parallel flows. Depending on
> whether you've defined separate slot sharing groups for them or not, it
> might be the case that parallel subtasks of all 300 parallel jobs share the
> same slot (if you haven't changed the slot sharing group). Depending on
> what you calculate, this can be inefficient because the individual tasks
> don't get much computation time. Moreover, all tasks will allocate some
> objects on the heap which can lead to more GC. Therefore, it might make
> sense to group some of the jobs together and run these jobs in batches
> after the previous batch completed. But this is hard to say without knowing
> the details of your job and getting a glimpse at the JobManager logs.
>
>
>
> Concerning the exception you're seeing, it would also be helpful to see
> the logs of the client and the JobManager. Actually, the scheduling of the
> job is independent of the response. Only the creation of the ExecutionGraph
> and making the JobGraph highly available in case of an HA setup are
> executed before the JobManager acknowledges the job submission. Only if
> this acknowledge message is not received in time on the client side, then
> the SubmissionTimeoutException is thrown. Therefore, I assume that somehow
> the JobManager is too busy or kept from sending the acknowledge message.
>
>
>
> Cheers,
>
> Till
>
>
>
>
>
>
>
> On Thu, Nov 2, 2017 at 7:18 PM, Chan, Regina <Re...@gs.com> wrote:
>
> Does it copy per TaskManager or per operator? I only gave it 10
> TaskManagers with 2 slots. I’m perfectly fine with it queuing up and
> running when it has the resources to.
>
>
>
>
>
>
>
> *From:* Chesnay Schepler [mailto:chesnay@apache.org]
> *Sent:* Wednesday, November 01, 2017 7:09 AM
> *To:* user@flink.apache.org
> *Subject:* Re: Job Manager Configuration
>
>
>
> AFAIK there is no theoretical limit on the size of the plan, it just
> depends on the available resources.
>
>
>
> The job submissions times out since it takes too long to deploy all the
> operators that the job defines. With 300 flows, each with 6 operators
> you're looking at potentially (1800 * parallelism) tasks that have to be
> deployed. For each task Flink copies the user-code of *all* flows to the
> executing TaskManager, which the network may just not be handle in time.
>
> I suggest to split your job into smaller batches or even run each of them
> independently.
>
> On 31.10.2017 16:25, Chan, Regina wrote:
>
> Asking an additional question, what is the largest plan that the
> JobManager can handle? Is there a limit? My flows don’t need to run in
> parallel and can run independently. I wanted them to run in one single job
> because it’s part of one logical commit on my side.
>
>
>
> Thanks,
>
> Regina
>
>
>
> *From:* Chan, Regina [Tech]
> *Sent:* Monday, October 30, 2017 3:22 PM
> *To:* 'user@flink.apache.org'
> *Subject:* Job Manager Configuration
>
>
>
> Flink Users,
>
>
>
> I have about 300 parallel flows in one job each with 2 inputs, 3
> operators, and 1 sink which makes for a large job. I keep getting the below
> timeout exception but I’ve already set it to a 30 minute time out with a
> 6GB heap on the JobManager? Is there a heuristic to better configure the
> job manager?
>
>
>
> Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException:
> Job submission to the JobManager timed out. You may increase
> 'akka.client.timeout' in case the JobManager needs more time to configure
> and confirm the job submission.
>
>
>
> *Regina Chan*
>
> *Goldman Sachs** –* Enterprise Platforms, Data Architecture
>
> *30 Hudson Street, 37th floor | Jersey City, NY 07302
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D30-2BHudson-2BStreet-2C-2B37th-2Bfloor-2B-257C-2BJersey-2BCity-2C-2BNY-2B07302-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=laVSZTJAQISd6BKl5JXEilWYowD61y4Ai_UMr2jf_9c&s=X1OLt2bSLUDeiuNf2MdsX_68SjaV87OwfR1puLmsKlc&e=>*
> (  (212) 902-5697
>
>
>
>
>
>
>
>
>

RE: Job Manager Configuration

Posted by "Chan, Regina" <Re...@gs.com>.

Thanks for the responses!

I’m currently using 1.2.0 – going to bump it up once I have things stabilized. I haven’t defined any slot sharing groups but I do think that I’ve probably got my job configured sub optimally. I’ve refactored my code so that I can submit subsets of the flow at a time and it seems to work. The break between the JobManager able to acknowledge job and not seems to hover somewhere between 10-20 flows.

I guess what doesn’t make too much sense to me is if the user code is uploaded once to the JobManager and downloaded from each TaskManager, what exactly is the JobManager doing that’s keeping it busy? It’s the same code across the TaskManagers.

I’ll get you the logs shortly.

From: Till Rohrmann [mailto:trohrmann@apache.org]
Sent: Wednesday, November 08, 2017 10:17 AM
To: Chan, Regina [Tech]
Cc: Chesnay Schepler; user@flink.apache.org
Subject: Re: Job Manager Configuration

Quick question Regina: Which version of Flink are you running?

Cheers,
Till

On Tue, Nov 7, 2017 at 4:38 PM, Till Rohrmann <ti...@gmail.com>> wrote:
Hi Regina,

the user code is uploaded once to the `JobManager` and then downloaded from each `TaskManager` once when it first receives the command to execute the first task of your job.

As Chesnay said there is no fundamental limitation to the size of the Flink job. However, it might be the case that you have configured your job sub-optimally. You said that you have 300 parallel flows. Depending on whether you've defined separate slot sharing groups for them or not, it might be the case that parallel subtasks of all 300 parallel jobs share the same slot (if you haven't changed the slot sharing group). Depending on what you calculate, this can be inefficient because the individual tasks don't get much computation time. Moreover, all tasks will allocate some objects on the heap which can lead to more GC. Therefore, it might make sense to group some of the jobs together and run these jobs in batches after the previous batch completed. But this is hard to say without knowing the details of your job and getting a glimpse at the JobManager logs.

Concerning the exception you're seeing, it would also be helpful to see the logs of the client and the JobManager. Actually, the scheduling of the job is independent of the response. Only the creation of the ExecutionGraph and making the JobGraph highly available in case of an HA setup are executed before the JobManager acknowledges the job submission. Only if this acknowledge message is not received in time on the client side, then the SubmissionTimeoutException is thrown. Therefore, I assume that somehow the JobManager is too busy or kept from sending the acknowledge message.

Cheers,
Till



On Thu, Nov 2, 2017 at 7:18 PM, Chan, Regina <Re...@gs.com>> wrote:
Does it copy per TaskManager or per operator? I only gave it 10 TaskManagers with 2 slots. I’m perfectly fine with it queuing up and running when it has the resources to.



From: Chesnay Schepler [mailto:chesnay@apache.org<ma...@apache.org>]
Sent: Wednesday, November 01, 2017 7:09 AM
To: user@flink.apache.org<ma...@flink.apache.org>
Subject: Re: Job Manager Configuration

AFAIK there is no theoretical limit on the size of the plan, it just depends on the available resources.


The job submissions times out since it takes too long to deploy all the operators that the job defines. With 300 flows, each with 6 operators you're looking at potentially (1800 * parallelism) tasks that have to be deployed. For each task Flink copies the user-code of all flows to the executing TaskManager, which the network may just not be handle in time.

I suggest to split your job into smaller batches or even run each of them independently.

On 31.10.2017 16:25, Chan, Regina wrote:
Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don’t need to run in parallel and can run independently. I wanted them to run in one single job because it’s part of one logical commit on my side.

Thanks,
Regina

From: Chan, Regina [Tech]
Sent: Monday, October 30, 2017 3:22 PM
To: 'user@flink.apache.org<ma...@flink.apache.org>'
Subject: Job Manager Configuration

Flink Users,

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I’ve already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

Regina Chan
Goldman Sachs – Enterprise Platforms, Data Architecture
30 Hudson Street, 37th floor | Jersey City, NY 07302<https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D30-2BHudson-2BStreet-2C-2B37th-2Bfloor-2B-257C-2BJersey-2BCity-2C-2BNY-2B07302-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=laVSZTJAQISd6BKl5JXEilWYowD61y4Ai_UMr2jf_9c&s=X1OLt2bSLUDeiuNf2MdsX_68SjaV87OwfR1puLmsKlc&e=> •  (212) 902-5697<tel:(212)%20902-5697>

Re: Job Manager Configuration

Posted by Till Rohrmann <tr...@apache.org>.

Quick question Regina: Which version of Flink are you running?

Cheers,
Till

On Tue, Nov 7, 2017 at 4:38 PM, Till Rohrmann <ti...@gmail.com>
wrote:

> Hi Regina,
>
> the user code is uploaded once to the `JobManager` and then downloaded
> from each `TaskManager` once when it first receives the command to execute
> the first task of your job.
>
> As Chesnay said there is no fundamental limitation to the size of the
> Flink job. However, it might be the case that you have configured your job
> sub-optimally. You said that you have 300 parallel flows. Depending on
> whether you've defined separate slot sharing groups for them or not, it
> might be the case that parallel subtasks of all 300 parallel jobs share the
> same slot (if you haven't changed the slot sharing group). Depending on
> what you calculate, this can be inefficient because the individual tasks
> don't get much computation time. Moreover, all tasks will allocate some
> objects on the heap which can lead to more GC. Therefore, it might make
> sense to group some of the jobs together and run these jobs in batches
> after the previous batch completed. But this is hard to say without knowing
> the details of your job and getting a glimpse at the JobManager logs.
>
> Concerning the exception you're seeing, it would also be helpful to see
> the logs of the client and the JobManager. Actually, the scheduling of the
> job is independent of the response. Only the creation of the ExecutionGraph
> and making the JobGraph highly available in case of an HA setup are
> executed before the JobManager acknowledges the job submission. Only if
> this acknowledge message is not received in time on the client side, then
> the SubmissionTimeoutException is thrown. Therefore, I assume that somehow
> the JobManager is too busy or kept from sending the acknowledge message.
>
> Cheers,
> Till
>
>
>
> On Thu, Nov 2, 2017 at 7:18 PM, Chan, Regina <Re...@gs.com> wrote:
>
>> Does it copy per TaskManager or per operator? I only gave it 10
>> TaskManagers with 2 slots. I’m perfectly fine with it queuing up and
>> running when it has the resources to.
>>
>>
>>
>>
>>
>>
>>
>> *From:* Chesnay Schepler [mailto:chesnay@apache.org]
>> *Sent:* Wednesday, November 01, 2017 7:09 AM
>> *To:* user@flink.apache.org
>> *Subject:* Re: Job Manager Configuration
>>
>>
>>
>> AFAIK there is no theoretical limit on the size of the plan, it just
>> depends on the available resources.
>>
>>
>> The job submissions times out since it takes too long to deploy all the
>> operators that the job defines. With 300 flows, each with 6 operators
>> you're looking at potentially (1800 * parallelism) tasks that have to be
>> deployed. For each task Flink copies the user-code of *all* flows to the
>> executing TaskManager, which the network may just not be handle in time.
>>
>> I suggest to split your job into smaller batches or even run each of them
>> independently.
>>
>> On 31.10.2017 16:25, Chan, Regina wrote:
>>
>> Asking an additional question, what is the largest plan that the
>> JobManager can handle? Is there a limit? My flows don’t need to run in
>> parallel and can run independently. I wanted them to run in one single job
>> because it’s part of one logical commit on my side.
>>
>>
>>
>> Thanks,
>>
>> Regina
>>
>>
>>
>> *From:* Chan, Regina [Tech]
>> *Sent:* Monday, October 30, 2017 3:22 PM
>> *To:* 'user@flink.apache.org'
>> *Subject:* Job Manager Configuration
>>
>>
>>
>> Flink Users,
>>
>>
>>
>> I have about 300 parallel flows in one job each with 2 inputs, 3
>> operators, and 1 sink which makes for a large job. I keep getting the below
>> timeout exception but I’ve already set it to a 30 minute time out with a
>> 6GB heap on the JobManager? Is there a heuristic to better configure the
>> job manager?
>>
>>
>>
>> Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException:
>> Job submission to the JobManager timed out. You may increase
>> 'akka.client.timeout' in case the JobManager needs more time to configure
>> and confirm the job submission.
>>
>>
>>
>> *Regina Chan*
>>
>> *Goldman Sachs* *–* Enterprise Platforms, Data Architecture
>>
>> *30 Hudson Street, 37th floor | Jersey City, NY 07302
>> <https://maps.google.com/?q=30+Hudson+Street,+37th+floor+%7C+Jersey+City,+NY+07302&entry=gmail&source=g>*
>> (  (212) 902-5697
>>
>>
>>
>>
>>
>
>

Re: Job Manager Configuration

Posted by Till Rohrmann <ti...@gmail.com>.

Hi Regina,

the user code is uploaded once to the `JobManager` and then downloaded from
each `TaskManager` once when it first receives the command to execute the
first task of your job.

As Chesnay said there is no fundamental limitation to the size of the Flink
job. However, it might be the case that you have configured your job
sub-optimally. You said that you have 300 parallel flows. Depending on
whether you've defined separate slot sharing groups for them or not, it
might be the case that parallel subtasks of all 300 parallel jobs share the
same slot (if you haven't changed the slot sharing group). Depending on
what you calculate, this can be inefficient because the individual tasks
don't get much computation time. Moreover, all tasks will allocate some
objects on the heap which can lead to more GC. Therefore, it might make
sense to group some of the jobs together and run these jobs in batches
after the previous batch completed. But this is hard to say without knowing
the details of your job and getting a glimpse at the JobManager logs.

Concerning the exception you're seeing, it would also be helpful to see the
logs of the client and the JobManager. Actually, the scheduling of the job
is independent of the response. Only the creation of the ExecutionGraph and
making the JobGraph highly available in case of an HA setup are executed
before the JobManager acknowledges the job submission. Only if this
acknowledge message is not received in time on the client side, then the
SubmissionTimeoutException is thrown. Therefore, I assume that somehow the
JobManager is too busy or kept from sending the acknowledge message.

Cheers,
Till

On Thu, Nov 2, 2017 at 7:18 PM, Chan, Regina <Re...@gs.com> wrote:

> Does it copy per TaskManager or per operator? I only gave it 10
> TaskManagers with 2 slots. I’m perfectly fine with it queuing up and
> running when it has the resources to.
>
>
>
>
>
>
>
> *From:* Chesnay Schepler [mailto:chesnay@apache.org]
> *Sent:* Wednesday, November 01, 2017 7:09 AM
> *To:* user@flink.apache.org
> *Subject:* Re: Job Manager Configuration
>
>
>
> AFAIK there is no theoretical limit on the size of the plan, it just
> depends on the available resources.
>
>
> The job submissions times out since it takes too long to deploy all the
> operators that the job defines. With 300 flows, each with 6 operators
> you're looking at potentially (1800 * parallelism) tasks that have to be
> deployed. For each task Flink copies the user-code of *all* flows to the
> executing TaskManager, which the network may just not be handle in time.
>
> I suggest to split your job into smaller batches or even run each of them
> independently.
>
> On 31.10.2017 16:25, Chan, Regina wrote:
>
> Asking an additional question, what is the largest plan that the
> JobManager can handle? Is there a limit? My flows don’t need to run in
> parallel and can run independently. I wanted them to run in one single job
> because it’s part of one logical commit on my side.
>
>
>
> Thanks,
>
> Regina
>
>
>
> *From:* Chan, Regina [Tech]
> *Sent:* Monday, October 30, 2017 3:22 PM
> *To:* 'user@flink.apache.org'
> *Subject:* Job Manager Configuration
>
>
>
> Flink Users,
>
>
>
> I have about 300 parallel flows in one job each with 2 inputs, 3
> operators, and 1 sink which makes for a large job. I keep getting the below
> timeout exception but I’ve already set it to a 30 minute time out with a
> 6GB heap on the JobManager? Is there a heuristic to better configure the
> job manager?
>
>
>
> Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException:
> Job submission to the JobManager timed out. You may increase
> 'akka.client.timeout' in case the JobManager needs more time to configure
> and confirm the job submission.
>
>
>
> *Regina Chan*
>
> *Goldman Sachs* *–* Enterprise Platforms, Data Architecture
>
> *30 Hudson Street, 37th floor | Jersey City, NY 07302
> <https://maps.google.com/?q=30+Hudson+Street,+37th+floor+%7C+Jersey+City,+NY+07302&entry=gmail&source=g>*
> (  (212) 902-5697
>
>
>
>
>

RE: Job Manager Configuration

Posted by "Chan, Regina" <Re...@gs.com>.

Does it copy per TaskManager or per operator? I only gave it 10 TaskManagers with 2 slots. I'm perfectly fine with it queuing up and running when it has the resources to.

From: Chesnay Schepler [mailto:chesnay@apache.org]
Sent: Wednesday, November 01, 2017 7:09 AM
To: user@flink.apache.org
Subject: Re: Job Manager Configuration

AFAIK there is no theoretical limit on the size of the plan, it just depends on the available resources.

The job submissions times out since it takes too long to deploy all the operators that the job defines. With 300 flows, each with 6 operators you're looking at potentially (1800 * parallelism) tasks that have to be deployed. For each task Flink copies the user-code of all flows to the executing TaskManager, which the network may just not be handle in time.

I suggest to split your job into smaller batches or even run each of them independently.

On 31.10.2017 16:25, Chan, Regina wrote:
Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don't need to run in parallel and can run independently. I wanted them to run in one single job because it's part of one logical commit on my side.

Thanks,
Regina

From: Chan, Regina [Tech]
Sent: Monday, October 30, 2017 3:22 PM
To: 'user@flink.apache.org<ma...@flink.apache.org>'
Subject: Job Manager Configuration

Flink Users,

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I've already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

Regina Chan
Goldman Sachs - Enterprise Platforms, Data Architecture
30 Hudson Street, 37th floor | Jersey City, NY 07302 *  (212) 902-5697

Re: Job Manager Configuration

Posted by Chesnay Schepler <ch...@apache.org>.

AFAIK there is no theoretical limit on the size of the plan, it just 
depends on the available resources.

The job submissions times out since it takes too long to deploy all the 
operators that the job defines. With 300 flows, each with 6 operators 
you're looking at potentially (1800 * parallelism) tasks that have to be 
deployed. For each task Flink copies the user-code of /all/ flows to the 
executing TaskManager, which the network may just not be handle in time.

I suggest to split your job into smaller batches or even run each of 
them independently.

On 31.10.2017 16:25, Chan, Regina wrote:
>
> Asking an additional question, what is the largest plan that the 
> JobManager can handle? Is there a limit? My flows don’t need to run in 
> parallel and can run independently. I wanted them to run in one single 
> job because it’s part of one logical commit on my side.
>
> Thanks,
>
> Regina
>
> *From:*Chan, Regina [Tech]
> *Sent:* Monday, October 30, 2017 3:22 PM
> *To:* 'user@flink.apache.org'
> *Subject:* Job Manager Configuration
>
> Flink Users,
>
> I have about 300 parallel flows in one job each with 2 inputs, 3 
> operators, and 1 sink which makes for a large job. I keep getting the 
> below timeout exception but I’ve already set it to a 30 minute time 
> out with a 6GB heap on the JobManager? Is there a heuristic to better 
> configure the job manager?
>
> Caused by: 
> org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: 
> Job submission to the JobManager timed out. You may increase 
> 'akka.client.timeout' in case the JobManager needs more time to 
> configure and confirm the job submission.
>
> *Regina Chan*
>
> *Goldman Sachs**–*Enterprise Platforms, Data Architecture
>
> *30 Hudson Street, 37th floor | Jersey City, NY 07302*((212) 902-5697**
>

Re: Job Manager Configuration

Posted by Joshua Griffith <JG...@CampusLabs.com>.

We run on a dedicated cluster managed by Kubernetes. The task managers run as a DaemonSet and the job manager runs as a Deployment. We had to increase the Akka frame size and client timeout on the service that submits jobs but we haven’t altered any Akka settings in the cluster. Here’s the container we run: https://github.com/orgsync/docker-flink

On Nov 18, 2017, at 4:10 PM, Chan, Regina <Re...@gs.com>> wrote:

Is your job running on a standalone cluster? I’m using a detached yarn session in a multi-tenant environment.
And I’m guessing you haven’t had to do anything special for the akka configurations.


From: Joshua Griffith [mailto:JGriffith@CampusLabs.com]
Sent: Thursday, November 16, 2017 2:57 PM
To: Chan, Regina [Tech]
Cc: user@flink.apache.org<ma...@flink.apache.org>
Subject: Re: Job Manager Configuration

I have an IO-dominated batch job with 471 distinct tasks (3786 tasks with parallelism) running on 8 nodes with 12 GiB of memory and 4 CPUs each. I haven’t had any problems adding additional tasks except for 1) tasks timing out the first time the cluster is started (I suppose the JVM needs to warm up), and 2) the UI can’t really handle this many tasks, although using Firefox Quantum makes it possible to see what’s going on.

Joshua

On Oct 31, 2017, at 10:25 AM, Chan, Regina <Re...@gs.com>> wrote:

Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don’t need to run in parallel and can run independently. I wanted them to run in one single job because it’s part of one logical commit on my side.

Thanks,
Regina

From: Chan, Regina [Tech]
Sent: Monday, October 30, 2017 3:22 PM
To: 'user@flink.apache.org<ma...@flink.apache.org>'
Subject: Job Manager Configuration

Flink Users,

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I’ve already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

Regina Chan
Goldman Sachs – Enterprise Platforms, Data Architecture
30 Hudson Street, 37th floor | Jersey City, NY 07302 •  (212) 902-5697

RE: Job Manager Configuration

Posted by "Chan, Regina" <Re...@gs.com>.

Is your job running on a standalone cluster? I’m using a detached yarn session in a multi-tenant environment.
And I’m guessing you haven’t had to do anything special for the akka configurations.

From: Joshua Griffith [mailto:JGriffith@CampusLabs.com]
Sent: Thursday, November 16, 2017 2:57 PM
To: Chan, Regina [Tech]
Cc: user@flink.apache.org
Subject: Re: Job Manager Configuration

I have an IO-dominated batch job with 471 distinct tasks (3786 tasks with parallelism) running on 8 nodes with 12 GiB of memory and 4 CPUs each. I haven’t had any problems adding additional tasks except for 1) tasks timing out the first time the cluster is started (I suppose the JVM needs to warm up), and 2) the UI can’t really handle this many tasks, although using Firefox Quantum makes it possible to see what’s going on.

Joshua

On Oct 31, 2017, at 10:25 AM, Chan, Regina <Re...@gs.com>> wrote:

Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don’t need to run in parallel and can run independently. I wanted them to run in one single job because it’s part of one logical commit on my side.

Thanks,
Regina

From: Chan, Regina [Tech]
Sent: Monday, October 30, 2017 3:22 PM
To: 'user@flink.apache.org<ma...@flink.apache.org>'
Subject: Job Manager Configuration

Flink Users,

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I’ve already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

Regina Chan
Goldman Sachs – Enterprise Platforms, Data Architecture
30 Hudson Street, 37th floor | Jersey City, NY 07302 •  (212) 902-5697

Re: Job Manager Configuration

Posted by Joshua Griffith <JG...@CampusLabs.com>.

I have an IO-dominated batch job with 471 distinct tasks (3786 tasks with parallelism) running on 8 nodes with 12 GiB of memory and 4 CPUs each. I haven’t had any problems adding additional tasks except for 1) tasks timing out the first time the cluster is started (I suppose the JVM needs to warm up), and 2) the UI can’t really handle this many tasks, although using Firefox Quantum makes it possible to see what’s going on.

Joshua

On Oct 31, 2017, at 10:25 AM, Chan, Regina <Re...@gs.com>> wrote:

Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don’t need to run in parallel and can run independently. I wanted them to run in one single job because it’s part of one logical commit on my side.

Thanks,
Regina

From: Chan, Regina [Tech]
Sent: Monday, October 30, 2017 3:22 PM
To: 'user@flink.apache.org<ma...@flink.apache.org>'
Subject: Job Manager Configuration

Flink Users,

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I’ve already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

Regina Chan
Goldman Sachs – Enterprise Platforms, Data Architecture
30 Hudson Street, 37th floor | Jersey City, NY 07302 •  (212) 902-5697

RE: Job Manager Configuration

Posted by "Chan, Regina" <Re...@gs.com>.

Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don't need to run in parallel and can run independently. I wanted them to run in one single job because it's part of one logical commit on my side.

Thanks,
Regina

From: Chan, Regina [Tech]
Sent: Monday, October 30, 2017 3:22 PM
To: 'user@flink.apache.org'
Subject: Job Manager Configuration

Flink Users,

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I've already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

Regina Chan
Goldman Sachs - Enterprise Platforms, Data Architecture
30 Hudson Street, 37th floor | Jersey City, NY 07302 *  (212) 902-5697