You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/05/02 14:02:04 UTC

[jira] [Commented] (FLINK-6020) Blob Server cannot handle multiple job submits (with same content) parallelly

    [ https://issues.apache.org/jira/browse/FLINK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992954#comment-15992954 ] 

ASF GitHub Bot commented on FLINK-6020:
---------------------------------------

Github user netguy204 commented on the issue:

    https://github.com/apache/flink/pull/3525
  
    +1 I'm looking forward to this fix as I think I'm encountering this bug in production.
    
    I bundle my jobs into a single JAR file with multiple mains. I submit the jobs to the cluster sequentially (once the cluster accepts one I submit the next). My job also has two dependency JARs that I provide via HTTP using the -C switch to flink.
    
    When a job fails it automatically restarts but it seems to cause other jobs from the same JAR to fail and restart as well. The error is always some variation of:
    
    ```
    java.lang.IllegalStateException: zip file closed
    	at java.util.zip.ZipFile.ensureOpen(ZipFile.java:669)
    	at java.util.zip.ZipFile.getEntry(ZipFile.java:309)
    	at java.util.jar.JarFile.getEntry(JarFile.java:240)
    	at sun.net.www.protocol.jar.URLJarFile.getEntry(URLJarFile.java:128)
    	at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
    	at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1005)
    	at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:983)
    	at sun.misc.URLClassPath.findResource(URLClassPath.java:188)
    	at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
    	at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
    	at java.security.AccessController.doPrivileged(Native Method)
    	at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
    	at java.lang.ClassLoader.getResource(ClassLoader.java:1093)
    	at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
            .... backtrace from some arbitrary point in my code that never is doing anything with reflection ...
    ```
    
    The class load that triggers the fault is arbitrary. The same job may fail and restart multiple times in the same day with a different failing class load.


> Blob Server cannot handle multiple job submits (with same content) parallelly
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-6020
>                 URL: https://issues.apache.org/jira/browse/FLINK-6020
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Distributed Coordination
>            Reporter: Tao Wang
>            Assignee: Tao Wang
>            Priority: Critical
>
> In yarn-cluster mode, if we submit one same job multiple times parallelly, the task will encounter class load problem and lease occuputation.
> Because blob server stores user jars in name with generated sha1sum of those, first writes a temp file and move it to finalialize. For recovery it also will put them to HDFS with same file name.
> In same time, when multiple clients sumit same job with same jar, the local jar files in blob server and those file on hdfs will be handled in multiple threads(BlobServerConnection), and impact each other.
> It's better to have a way to handle this, now two ideas comes up to my head:
> 1. lock the write operation, or
> 2. use some unique identifier as file name instead of ( or added up to) sha1sum of the file contents.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)