You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Robert Metzger (Jira)" <ji...@apache.org> on 2020/02/12 12:27:00 UTC
[jira] [Commented] (FLINK-16018) Improve error reporting when
submitting batch job (instead of AskTimeoutException)
[ https://issues.apache.org/jira/browse/FLINK-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035325#comment-17035325 ]
Robert Metzger commented on FLINK-16018:
----------------------------------------
Setting the config to {{web.timeout: 300000}} reveals the real underlying issue:
{code}
2020-02-07T15:59:57.2209501Z 2020-02-07 15:59:50,547 ERROR org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Failed to submit job 115b0668417af408b4a129499c634396.
2020-02-07T15:59:57.2209618Z java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
2020-02-07T15:59:57.2209725Z at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
2020-02-07T15:59:57.2209812Z at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
2020-02-07T15:59:57.2209908Z at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
2020-02-07T15:59:57.2209993Z at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
2020-02-07T15:59:57.2210098Z at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
2020-02-07T15:59:57.2210177Z at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
2020-02-07T15:59:57.2210274Z at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
2020-02-07T15:59:57.2210433Z at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2020-02-07T15:59:57.2210542Z Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
2020-02-07T15:59:57.2210631Z at org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:152)
2020-02-07T15:59:57.2210744Z at org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:84)
2020-02-07T15:59:57.2210844Z at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:379)
2020-02-07T15:59:57.2210945Z at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
2020-02-07T15:59:57.2211019Z ... 7 more
2020-02-07T15:59:57.2211550Z Caused by: org.apache.flink.runtime.client.JobExecutionException: Cannot initialize task 'DataSink (CsvOutputFormat (path: s3://test-data/temp/test_batch_wordcount-0205c494-01da-4cde-ae74-1925833efb57, delimiter: ))': doesBucketExist on test-data: com.amazonaws.SdkClientException: Unable to execute HTTP request: minio: Unable to execute HTTP request: minio
2020-02-07T15:59:57.2211736Z at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:216)
2020-02-07T15:59:57.2211831Z at org.apache.flink.runtime.scheduler.SchedulerBase.createExecutionGraph(SchedulerBase.java:253)
2020-02-07T15:59:57.2211942Z at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:225)
2020-02-07T15:59:57.2212028Z at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:213)
2020-02-07T15:59:57.2212127Z at org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:117)
2020-02-07T15:59:57.2212218Z at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:105)
2020-02-07T15:59:57.2212330Z at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278)
2020-02-07T15:59:57.2212497Z at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:266)
2020-02-07T15:59:57.2212592Z at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)
2020-02-07T15:59:57.2212714Z at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)
2020-02-07T15:59:57.2212815Z at org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:146)
2020-02-07T15:59:57.2212902Z ... 10 more
2020-02-07T15:59:57.2213257Z Caused by: org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on test-data: com.amazonaws.SdkClientException: Unable to execute HTTP request: minio: Unable to execute HTTP request: minio
2020-02-07T15:59:57.2213378Z at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:177)
2020-02-07T15:59:57.2213464Z at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:111)
2020-02-07T15:59:57.2213565Z at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:260)
2020-02-07T15:59:57.2213641Z at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:317)
2020-02-07T15:59:57.2213731Z at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:256)
2020-02-07T15:59:57.2213911Z at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:231)
2020-02-07T15:59:57.2214077Z at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:372)
2020-02-07T15:59:57.2214226Z at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:308)
2020-02-07T15:59:57.2214370Z at org.apache.flink.fs.s3.common.AbstractS3FileSystemFactory.create(AbstractS3FileSystemFactory.java:126)
2020-02-07T15:59:57.2214473Z at org.apache.flink.core.fs.PluginFileSystemFactory.create(PluginFileSystemFactory.java:61)
2020-02-07T15:59:57.2214559Z at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:441)
2020-02-07T15:59:57.2214750Z at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:362)
2020-02-07T15:59:57.2214831Z at org.apache.flink.core.fs.Path.getFileSystem(Path.java:298)
2020-02-07T15:59:57.2214928Z at org.apache.flink.api.common.io.FileOutputFormat.initializeGlobal(FileOutputFormat.java:275)
2020-02-07T15:59:57.2215019Z at org.apache.flink.runtime.jobgraph.InputOutputFormatVertex.initializeOnMaster(InputOutputFormatVertex.java:100)
2020-02-07T15:59:57.2215129Z at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:212)
2020-02-07T15:59:57.2215208Z ... 20 more
2020-02-07T15:59:57.2215291Z Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: minio
2020-02-07T15:59:57.2215377Z at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1114)
2020-02-07T15:59:57.2215488Z at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1064)
2020-02-07T15:59:57.2215582Z at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
2020-02-07T15:59:57.2215693Z at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
2020-02-07T15:59:57.2215780Z at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
2020-02-07T15:59:57.2215885Z at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
2020-02-07T15:59:57.2215972Z at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
2020-02-07T15:59:57.2216070Z at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
2020-02-07T15:59:57.2216154Z at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4325)
2020-02-07T15:59:57.2216251Z at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4272)
2020-02-07T15:59:57.2216338Z at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1337)
2020-02-07T15:59:57.2216441Z at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1277)
2020-02-07T15:59:57.2216613Z at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$verifyBucketExists$1(S3AFileSystem.java:373)
2020-02-07T15:59:57.2216709Z at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
2020-02-07T15:59:57.2216774Z ... 34 more
2020-02-07T15:59:57.2216848Z Caused by: java.net.UnknownHostException: minio
2020-02-07T15:59:57.2216918Z at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
2020-02-07T15:59:57.2217006Z at java.net.InetAddress.getAllByName(InetAddress.java:1193)
2020-02-07T15:59:57.2217080Z at java.net.InetAddress.getAllByName(InetAddress.java:1127)
2020-02-07T15:59:57.2217168Z at com.amazonaws.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:27)
2020-02-07T15:59:57.2217250Z at com.amazonaws.http.DelegatingDnsResolver.resolve(DelegatingDnsResolver.java:38)
2020-02-07T15:59:57.2217362Z at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112)
2020-02-07T15:59:57.2217468Z at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359)
2020-02-07T15:59:57.2217564Z at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
2020-02-07T15:59:57.2217656Z at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2020-02-07T15:59:57.2217734Z at java.lang.reflect.Method.invoke(Method.java:498)
2020-02-07T15:59:57.2217832Z at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
2020-02-07T15:59:57.2217919Z at com.amazonaws.http.conn.$Proxy33.connect(Unknown Source)
2020-02-07T15:59:57.2218014Z at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)
2020-02-07T15:59:57.2218097Z at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
2020-02-07T15:59:57.2218322Z at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
2020-02-07T15:59:57.2218411Z at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
2020-02-07T15:59:57.2218507Z at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
2020-02-07T15:59:57.2218593Z at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
2020-02-07T15:59:57.2218689Z at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
2020-02-07T15:59:57.2218775Z at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1236)
2020-02-07T15:59:57.2218880Z at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1056)
2020-02-07T15:59:57.2218953Z ... 46 more
{code}
As a user, I would hope to see this exception, or at least a stack trace into this code path, to have an indication what the problem might be.
> Improve error reporting when submitting batch job (instead of AskTimeoutException)
> ----------------------------------------------------------------------------------
>
> Key: FLINK-16018
> URL: https://issues.apache.org/jira/browse/FLINK-16018
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.11.0
> Reporter: Robert Metzger
> Priority: Major
>
> While debugging the {{Shaded Hadoop S3A end-to-end test (minio)}} pre-commit test, I noticed that the JobSubmission is not producing very helpful error messages.
> Environment:
> - A simple batch wordcount job
> - a unavailable minio s3 filesystem service
> What happens from a user's perspective:
> - The job submission fails after 10 seconds with a AskTimeoutException:
> {code}
> 2020-02-07T11:38:27.1189393Z akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-939201095]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
> 2020-02-07T11:38:27.1189538Z at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189616Z at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189713Z at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-02-07T11:38:27.1189789Z at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-02-07T11:38:27.1189883Z at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-02-07T11:38:27.1189973Z at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-02-07T11:38:27.1190067Z at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-02-07T11:38:27.1190159Z at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-02-07T11:38:27.1190267Z at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-02-07T11:38:27.1190358Z at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-02-07T11:38:27.1190465Z at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-02-07T11:38:27.1190540Z at java.lang.Thread.run(Thread.java:748)
> {code}
> What a user would expect:
> - An error message indicating why the job submission failed.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)