You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Robert Metzger (Jira)" <ji...@apache.org> on 2020/02/12 12:27:00 UTC

[jira] [Commented] (FLINK-16018) Improve error reporting when submitting batch job (instead of AskTimeoutException)

    [ https://issues.apache.org/jira/browse/FLINK-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035325#comment-17035325 ] 

Robert Metzger commented on FLINK-16018:
----------------------------------------

Setting the config to {{web.timeout: 300000}} reveals the real underlying issue:
{code}
2020-02-07T15:59:57.2209501Z 2020-02-07 15:59:50,547 ERROR org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Failed to submit job 115b0668417af408b4a129499c634396.
2020-02-07T15:59:57.2209618Z java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
2020-02-07T15:59:57.2209725Z 	at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
2020-02-07T15:59:57.2209812Z 	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
2020-02-07T15:59:57.2209908Z 	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
2020-02-07T15:59:57.2209993Z 	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
2020-02-07T15:59:57.2210098Z 	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
2020-02-07T15:59:57.2210177Z 	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
2020-02-07T15:59:57.2210274Z 	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
2020-02-07T15:59:57.2210433Z 	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2020-02-07T15:59:57.2210542Z Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
2020-02-07T15:59:57.2210631Z 	at org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:152)
2020-02-07T15:59:57.2210744Z 	at org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:84)
2020-02-07T15:59:57.2210844Z 	at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:379)
2020-02-07T15:59:57.2210945Z 	at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
2020-02-07T15:59:57.2211019Z 	... 7 more
2020-02-07T15:59:57.2211550Z Caused by: org.apache.flink.runtime.client.JobExecutionException: Cannot initialize task 'DataSink (CsvOutputFormat (path: s3://test-data/temp/test_batch_wordcount-0205c494-01da-4cde-ae74-1925833efb57, delimiter:  ))': doesBucketExist on test-data: com.amazonaws.SdkClientException: Unable to execute HTTP request: minio: Unable to execute HTTP request: minio
2020-02-07T15:59:57.2211736Z 	at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:216)
2020-02-07T15:59:57.2211831Z 	at org.apache.flink.runtime.scheduler.SchedulerBase.createExecutionGraph(SchedulerBase.java:253)
2020-02-07T15:59:57.2211942Z 	at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:225)
2020-02-07T15:59:57.2212028Z 	at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:213)
2020-02-07T15:59:57.2212127Z 	at org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:117)
2020-02-07T15:59:57.2212218Z 	at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:105)
2020-02-07T15:59:57.2212330Z 	at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278)
2020-02-07T15:59:57.2212497Z 	at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:266)
2020-02-07T15:59:57.2212592Z 	at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)
2020-02-07T15:59:57.2212714Z 	at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)
2020-02-07T15:59:57.2212815Z 	at org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:146)
2020-02-07T15:59:57.2212902Z 	... 10 more
2020-02-07T15:59:57.2213257Z Caused by: org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on test-data: com.amazonaws.SdkClientException: Unable to execute HTTP request: minio: Unable to execute HTTP request: minio
2020-02-07T15:59:57.2213378Z 	at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:177)
2020-02-07T15:59:57.2213464Z 	at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:111)
2020-02-07T15:59:57.2213565Z 	at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:260)
2020-02-07T15:59:57.2213641Z 	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:317)
2020-02-07T15:59:57.2213731Z 	at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:256)
2020-02-07T15:59:57.2213911Z 	at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:231)
2020-02-07T15:59:57.2214077Z 	at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:372)
2020-02-07T15:59:57.2214226Z 	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:308)
2020-02-07T15:59:57.2214370Z 	at org.apache.flink.fs.s3.common.AbstractS3FileSystemFactory.create(AbstractS3FileSystemFactory.java:126)
2020-02-07T15:59:57.2214473Z 	at org.apache.flink.core.fs.PluginFileSystemFactory.create(PluginFileSystemFactory.java:61)
2020-02-07T15:59:57.2214559Z 	at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:441)
2020-02-07T15:59:57.2214750Z 	at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:362)
2020-02-07T15:59:57.2214831Z 	at org.apache.flink.core.fs.Path.getFileSystem(Path.java:298)
2020-02-07T15:59:57.2214928Z 	at org.apache.flink.api.common.io.FileOutputFormat.initializeGlobal(FileOutputFormat.java:275)
2020-02-07T15:59:57.2215019Z 	at org.apache.flink.runtime.jobgraph.InputOutputFormatVertex.initializeOnMaster(InputOutputFormatVertex.java:100)
2020-02-07T15:59:57.2215129Z 	at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:212)
2020-02-07T15:59:57.2215208Z 	... 20 more
2020-02-07T15:59:57.2215291Z Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: minio
2020-02-07T15:59:57.2215377Z 	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1114)
2020-02-07T15:59:57.2215488Z 	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1064)
2020-02-07T15:59:57.2215582Z 	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
2020-02-07T15:59:57.2215693Z 	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
2020-02-07T15:59:57.2215780Z 	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
2020-02-07T15:59:57.2215885Z 	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
2020-02-07T15:59:57.2215972Z 	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
2020-02-07T15:59:57.2216070Z 	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
2020-02-07T15:59:57.2216154Z 	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4325)
2020-02-07T15:59:57.2216251Z 	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4272)
2020-02-07T15:59:57.2216338Z 	at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1337)
2020-02-07T15:59:57.2216441Z 	at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1277)
2020-02-07T15:59:57.2216613Z 	at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$verifyBucketExists$1(S3AFileSystem.java:373)
2020-02-07T15:59:57.2216709Z 	at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
2020-02-07T15:59:57.2216774Z 	... 34 more
2020-02-07T15:59:57.2216848Z Caused by: java.net.UnknownHostException: minio
2020-02-07T15:59:57.2216918Z 	at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
2020-02-07T15:59:57.2217006Z 	at java.net.InetAddress.getAllByName(InetAddress.java:1193)
2020-02-07T15:59:57.2217080Z 	at java.net.InetAddress.getAllByName(InetAddress.java:1127)
2020-02-07T15:59:57.2217168Z 	at com.amazonaws.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:27)
2020-02-07T15:59:57.2217250Z 	at com.amazonaws.http.DelegatingDnsResolver.resolve(DelegatingDnsResolver.java:38)
2020-02-07T15:59:57.2217362Z 	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112)
2020-02-07T15:59:57.2217468Z 	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359)
2020-02-07T15:59:57.2217564Z 	at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
2020-02-07T15:59:57.2217656Z 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2020-02-07T15:59:57.2217734Z 	at java.lang.reflect.Method.invoke(Method.java:498)
2020-02-07T15:59:57.2217832Z 	at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
2020-02-07T15:59:57.2217919Z 	at com.amazonaws.http.conn.$Proxy33.connect(Unknown Source)
2020-02-07T15:59:57.2218014Z 	at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)
2020-02-07T15:59:57.2218097Z 	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
2020-02-07T15:59:57.2218322Z 	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
2020-02-07T15:59:57.2218411Z 	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
2020-02-07T15:59:57.2218507Z 	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
2020-02-07T15:59:57.2218593Z 	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
2020-02-07T15:59:57.2218689Z 	at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
2020-02-07T15:59:57.2218775Z 	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1236)
2020-02-07T15:59:57.2218880Z 	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1056)
2020-02-07T15:59:57.2218953Z 	... 46 more
{code}

As a user, I would hope to see this exception, or at least a stack trace into this code path, to have an indication what the problem might be.

> Improve error reporting when submitting batch job (instead of AskTimeoutException)
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-16018
>                 URL: https://issues.apache.org/jira/browse/FLINK-16018
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.0
>            Reporter: Robert Metzger
>            Priority: Major
>
> While debugging the {{Shaded Hadoop S3A end-to-end test (minio)}} pre-commit test, I noticed that the JobSubmission is not producing very helpful error messages.
> Environment:
> - A simple batch wordcount job 
> - a unavailable minio s3 filesystem service
> What happens from a user's perspective:
> - The job submission fails after 10 seconds with a AskTimeoutException:
> {code}
> 2020-02-07T11:38:27.1189393Z akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-939201095]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
> 2020-02-07T11:38:27.1189538Z 	at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189616Z 	at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189713Z 	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-02-07T11:38:27.1189789Z 	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-02-07T11:38:27.1189883Z 	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-02-07T11:38:27.1189973Z 	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-02-07T11:38:27.1190067Z 	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-02-07T11:38:27.1190159Z 	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-02-07T11:38:27.1190267Z 	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-02-07T11:38:27.1190358Z 	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-02-07T11:38:27.1190465Z 	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-02-07T11:38:27.1190540Z 	at java.lang.Thread.run(Thread.java:748)
> {code}
> What a user would expect:
> - An error message indicating why the job submission failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)