You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by "Chan, Regina" <Re...@gs.com> on 2017/12/12 06:55:37 UTC

ProgramInvocationException: Could not upload the jar files to the job manager / No space left on device

Hi,

I'm currently submitting 50 separate jobs to a 50TM, 1 slot set up. Each job has 1 parallelism. There's plenty of space left in my cluster and on that node. It's not clear to me what's happening. Any pointers?

On the client side, when I try to execute, I see the following:
org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Could not upload the jar files to the job manager.
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
        at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)
        at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
        at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:926)
        at com.gs.ep.da.lake.refinerlib.flink.FlowData.execute(FlowData.java:143)
        at com.gs.ep.da.lake.refinerlib.flink.FlowData.flowPartialIngestionHalf(FlowData.java:107)
        at com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:72)
        at com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:39)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Could not upload the jar files to the job manager.
        at org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:150)
        at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:95)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.io.IOException: Could not retrieve the JobManager's blob port.
        at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:745)
        at org.apache.flink.runtime.jobgraph.JobGraph.uploadUserJars(JobGraph.java:565)
        at org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:148)
        ... 9 more
Caused by: java.io.IOException: PUT operation failed: Connection reset
        at org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:512)
        at org.apache.flink.runtime.blob.BlobClient.put(BlobClient.java:374)
        at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:771)
        at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:740)
        ... 11 more
Caused by: java.net.SocketException: Connection reset
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
        at org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:499)
        ... 14 more


On the job manager logs I see this:

2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - PUT operation failed
java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:345)
        at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)
        at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)
2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - PUT operation failed
java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:345)
        at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)
        at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)
2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - PUT operation failed
java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:345)
        at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)
        at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)
2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - PUT operation failed
java.io.IOException: No space left on device




Regina Chan
Goldman Sachs - Enterprise Platforms, Data Architecture
30 Hudson Street, 37th floor | Jersey City, NY 07302 *  (212) 902-5697


Re: ProgramInvocationException: Could not upload the jar files to the job manager / No space left on device

Posted by Nico Kruber <ni...@data-artisans.com>.
Hi Regina,
judging from the exception you posted, this is not about storing the
file in HDFS, but a step before that where the BlobServer first puts the
incoming file into its local file system in the directory given by the
`blob.storage.directory` configuration property. If this property is not
set or empty, it will fall back to `java.io.tmpdir`. The BlobServer
creates a subdirectory `blobStore-<UUID>` and put incoming files into
`<storage-dir>/blobStore-<UUID>/incoming` with file names
`temp-12345678` (using an atomic file counter). It seems that there is
no space left in the filesystem of this directory.

If you set the log level to INFO, you should see a message like "Created
BLOB server storage directory ..." with the path. Can you double check
whether there is really no space left there?


Nico

On 12/12/17 08:02, Chan, Regina wrote:
> And if it helps, I’m running on flink 1.2.1. I saw this ticket:
> https://issues.apache.org/jira/browse/FLINK-5828 It only started
> happening when I was running all 50 flows at the same time. However, it
> looks like it’s not an issue with creating the cache directory but with
> running out of space there? But what’s in there is also tiny.
> 
>  
> 
> bash-4.1$ hdfs dfs -du -h
> hdfs://d191291/user/delp/.flink/application_1510733430616_2098853
> 
> 1.1 K   
> hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/5c71e4b6-2567-4d34-98dc-73b29c502736-taskmanager-conf.yaml
> 
> 1.4 K   
> hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/flink-conf.yaml
> 
> 93.5 M  
> hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/flink-dist_2.10-1.2.1.jar
> 
> 264.8 M 
> hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/lib
> 
> 1.9 K   
> hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/log4j.properties
> 
>  
> 
>  
> 
> *From:*Chan, Regina [Tech]
> *Sent:* Tuesday, December 12, 2017 1:56 AM
> *To:* 'user@flink.apache.org'
> *Subject:* ProgramInvocationException: Could not upload the jar files to
> the job manager / No space left on device
> 
>  
> 
> Hi,
> 
>  
> 
> I’m currently submitting 50 separate jobs to a 50TM, 1 slot set up. Each
> job has 1 parallelism. There’s plenty of space left in my cluster and on
> that node. It’s not clear to me what’s happening. Any pointers?
> 
>  
> 
> On the client side, when I try to execute, I see the following:
> 
> org.apache.flink.client.program.ProgramInvocationException: The program
> execution failed: Could not upload the jar files to the job manager.
> 
>         at
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
> 
>         at
> org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101)
> 
>         at
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400)
> 
>         at
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)
> 
>         at
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
> 
>         at
> org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:926)
> 
>         at
> com.gs.ep.da.lake.refinerlib.flink.FlowData.execute(FlowData.java:143)
> 
>         at
> com.gs.ep.da.lake.refinerlib.flink.FlowData.flowPartialIngestionHalf(FlowData.java:107)
> 
>         at
> com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:72)
> 
>         at
> com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:39)
> 
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> 
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 
>         at java.lang.Thread.run(Thread.java:745)
> 
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Could
> not upload the jar files to the job manager.
> 
>         at
> org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:150)
> 
>         at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:95)
> 
>         at
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> 
>         at
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> 
>         at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
> 
>         at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
> 
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 
>         at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 
>         at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 
> Caused by: java.io.IOException: Could not retrieve the JobManager's blob
> port.
> 
>         at
> org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:745)
> 
>         at
> org.apache.flink.runtime.jobgraph.JobGraph.uploadUserJars(JobGraph.java:565)
> 
>         at
> org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:148)
> 
>         ... 9 more
> 
> Caused by: java.io.IOException: PUT operation failed: Connection reset
> 
>         at
> org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:512)
> 
>         at org.apache.flink.runtime.blob.BlobClient.put(BlobClient.java:374)
> 
>         at
> org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:771)
> 
>         at
> org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:740)
> 
>         ... 11 more
> 
> Caused by: java.net.SocketException: Connection reset
> 
>         at
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118)
> 
>         at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
> 
>         at
> org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:499)
> 
>         ... 14 more
> 
>  
> 
>  
> 
> On the job manager logs I see this:
> 
>  
> 
> 2017-12-12 01:42:47,608 ERROR
> org.apache.flink.runtime.blob.BlobServerConnection            - PUT
> operation failed
> 
> java.io.IOException: No space left on device
> 
>         at java.io.FileOutputStream.writeBytes(Native Method)
> 
>         at java.io.FileOutputStream.write(FileOutputStream.java:345)
> 
>         at
> org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)
> 
>         at
> org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)
> 
> 2017-12-12 01:42:47,608 ERROR
> org.apache.flink.runtime.blob.BlobServerConnection            - PUT
> operation failed
> 
> java.io.IOException: No space left on device
> 
>         at java.io.FileOutputStream.writeBytes(Native Method)
> 
>         at java.io.FileOutputStream.write(FileOutputStream.java:345)
> 
>         at
> org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)
> 
>         at
> org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)
> 
> 2017-12-12 01:42:47,608 ERROR
> org.apache.flink.runtime.blob.BlobServerConnection            - PUT
> operation failed
> 
> java.io.IOException: No space left on device
> 
>         at java.io.FileOutputStream.writeBytes(Native Method)
> 
>         at java.io.FileOutputStream.write(FileOutputStream.java:345)
> 
>         at
> org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)
> 
>         at
> org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)
> 
> 2017-12-12 01:42:47,608 ERROR
> org.apache.flink.runtime.blob.BlobServerConnection            - PUT
> operation failed
> 
> java.io.IOException: No space left on device
> 
>  
> 
>  
> 
>  
> 
>  
> 
> *Regina Chan*
> 
> *Goldman Sachs**–*Enterprise Platforms, Data Architecture
> 
> *30 Hudson Street, 37th floor | Jersey City, NY 07302*(  (212) 902-5697**
> 
>  
> 


RE: ProgramInvocationException: Could not upload the jar files to the job manager / No space left on device

Posted by "Chan, Regina" <Re...@gs.com>.
And if it helps, I'm running on flink 1.2.1. I saw this ticket: https://issues.apache.org/jira/browse/FLINK-5828 It only started happening when I was running all 50 flows at the same time. However, it looks like it's not an issue with creating the cache directory but with running out of space there? But what's in there is also tiny.

bash-4.1$ hdfs dfs -du -h hdfs://d191291/user/delp/.flink/application_1510733430616_2098853
1.1 K    hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/5c71e4b6-2567-4d34-98dc-73b29c502736-taskmanager-conf.yaml
1.4 K    hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/flink-conf.yaml
93.5 M   hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/flink-dist_2.10-1.2.1.jar
264.8 M  hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/lib
1.9 K    hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/log4j.properties


From: Chan, Regina [Tech]
Sent: Tuesday, December 12, 2017 1:56 AM
To: 'user@flink.apache.org'
Subject: ProgramInvocationException: Could not upload the jar files to the job manager / No space left on device

Hi,

I'm currently submitting 50 separate jobs to a 50TM, 1 slot set up. Each job has 1 parallelism. There's plenty of space left in my cluster and on that node. It's not clear to me what's happening. Any pointers?

On the client side, when I try to execute, I see the following:
org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Could not upload the jar files to the job manager.
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
        at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)
        at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
        at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:926)
        at com.gs.ep.da.lake.refinerlib.flink.FlowData.execute(FlowData.java:143)
        at com.gs.ep.da.lake.refinerlib.flink.FlowData.flowPartialIngestionHalf(FlowData.java:107)
        at com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:72)
        at com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:39)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Could not upload the jar files to the job manager.
        at org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:150)
        at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:95)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.io.IOException: Could not retrieve the JobManager's blob port.
        at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:745)
        at org.apache.flink.runtime.jobgraph.JobGraph.uploadUserJars(JobGraph.java:565)
        at org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:148)
        ... 9 more
Caused by: java.io.IOException: PUT operation failed: Connection reset
        at org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:512)
        at org.apache.flink.runtime.blob.BlobClient.put(BlobClient.java:374)
        at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:771)
        at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:740)
        ... 11 more
Caused by: java.net.SocketException: Connection reset
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
        at org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:499)
        ... 14 more


On the job manager logs I see this:

2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - PUT operation failed
java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:345)
        at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)
        at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)
2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - PUT operation failed
java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:345)
        at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)
        at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)
2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - PUT operation failed
java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:345)
        at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)
        at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)
2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - PUT operation failed
java.io.IOException: No space left on device




Regina Chan
Goldman Sachs - Enterprise Platforms, Data Architecture
30 Hudson Street, 37th floor | Jersey City, NY 07302 *  (212) 902-5697