You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by "Chan, Regina" <Re...@gs.com> on 2018/04/30 19:53:57 UTC

RE: Fat jar fails deployment (streaming job too large)

Any updates on this one? I'm seeing similar issues with 1.3.3 and the batch api. 

Main difference is that I have even more operators ~850, mostly maps and filters with one cogroup. I don't really want to set a akka.client.timeout for anything more than 10 minutes seeing that it still fails with that amount. The akka.framesize is already 500Mb... 

akka.framesize: 524288000b
akka.ask.timeout: 10min
akka.client.timeout: 10min
akka.lookup.timeout: 10min


Thanks,
Regina



-----Original Message-----
From: Niels [mailto:nielsdenissen@gmail.com] 
Sent: Tuesday, February 27, 2018 11:40 AM
To: user@flink.apache.org
Subject: Re: Fat jar fails deployment (streaming job too large)

Hi Till,

I've just tried to set on the *client*:
akka.client.timeout: 300s 

On the *cluster*:
akka.ask.timeout: 30s
akka.lookup.timeout: 30s
akka.client.timeout: 300s
akka.framesize: 104857600b #(10x the original of 10MB)
akka.log.lifecycle.events: true

Still gives me the same issue, the fat jar isn't deployed. See the attached
files for the logs of the jobmanager and the deployer. Let me know if I can
provide you with any additional info. Thanks for your help!

Cheers,
Niels

Flink_deploy_log.txt
<https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_file_t1147_Flink-5Fdeploy-5Flog.txt&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=HxWMISxclHHjDET_E_zY-P95lt5mvMxU7YfGx9vyFcg&e= >  
flink_jobmanager_log.txt
<https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_file_t1147_flink-5Fjobmanager-5Flog.txt&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=8PvIcLRPFokJ5XOPsczSatUddfM-xd6eG_FxaDlHEBk&e= >  





--
Sent from: https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=yX4z6UV1AFsAQtJsVquzujhFio0CgYr-tAIoroUXP8E&e= 

Re: Fat jar fails deployment (streaming job too large)

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Short answer: could be that your job is simply too big to be serialised, distributed and deserialised in the given time and you would have to increase timeouts even more.

Long answer: 

Do you have the same problem when you try to submit smaller job? Does your cluster work for simpler jobs? Try cutting down/simplifying your job up to the point it works. Maybe you will be able to pin down one single operator that’s causing the problem (one that have for example huge static data structure). If so, you might be able to optimise your operators in some way. Maybe some operator is doing some weird things and causing problems.

You could also try to approach this problem from other direction (as previously suggested by Fabian). Try to profile/find out what the cluster is doing, where is the problem. Job Manager? One Task Manager? All of the Task Managers? Is there high cpu usage somewhere? Maybe one thread somewhere is overloaded? High network usage? After identifying potential problematic JVM’s, you could attach a code profiler or print stack traces to further pin down the problem. 

Piotrek

> On 30 Apr 2018, at 21:53, Chan, Regina <Re...@gs.com> wrote:
> 
> Any updates on this one? I'm seeing similar issues with 1.3.3 and the batch api. 
> 
> Main difference is that I have even more operators ~850, mostly maps and filters with one cogroup. I don't really want to set a akka.client.timeout for anything more than 10 minutes seeing that it still fails with that amount. The akka.framesize is already 500Mb... 
> 
> akka.framesize: 524288000b
> akka.ask.timeout: 10min
> akka.client.timeout: 10min
> akka.lookup.timeout: 10min
> 
> 
> Thanks,
> Regina
> 
> 
> 
> -----Original Message-----
> From: Niels [mailto:nielsdenissen@gmail.com <ma...@gmail.com>] 
> Sent: Tuesday, February 27, 2018 11:40 AM
> To: user@flink.apache.org <ma...@flink.apache.org>
> Subject: Re: Fat jar fails deployment (streaming job too large)
> 
> Hi Till,
> 
> I've just tried to set on the *client*:
> akka.client.timeout: 300s 
> 
> On the *cluster*:
> akka.ask.timeout: 30s
> akka.lookup.timeout: 30s
> akka.client.timeout: 300s
> akka.framesize: 104857600b #(10x the original of 10MB)
> akka.log.lifecycle.events: true
> 
> Still gives me the same issue, the fat jar isn't deployed. See the attached
> files for the logs of the jobmanager and the deployer. Let me know if I can
> provide you with any additional info. Thanks for your help!
> 
> Cheers,
> Niels
> 
> Flink_deploy_log.txt
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_file_t1147_Flink-5Fdeploy-5Flog.txt&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=HxWMISxclHHjDET_E_zY-P95lt5mvMxU7YfGx9vyFcg&e= <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_file_t1147_Flink-5Fdeploy-5Flog.txt&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=HxWMISxclHHjDET_E_zY-P95lt5mvMxU7YfGx9vyFcg&e=> >  
> flink_jobmanager_log.txt
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_file_t1147_flink-5Fjobmanager-5Flog.txt&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=8PvIcLRPFokJ5XOPsczSatUddfM-xd6eG_FxaDlHEBk&e= <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_file_t1147_flink-5Fjobmanager-5Flog.txt&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=8PvIcLRPFokJ5XOPsczSatUddfM-xd6eG_FxaDlHEBk&e=> >  
> 
> 
> 
> 
> 
> --
> Sent from: https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=yX4z6UV1AFsAQtJsVquzujhFio0CgYr-tAIoroUXP8E&e= <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=yX4z6UV1AFsAQtJsVquzujhFio0CgYr-tAIoroUXP8E&e=>